Abstract
A large fraction of the protein crystal structures deposited in the Protein Data Bank are incomplete, since the position of one or more residues is not reported, despite these residues are part of the material that was analyzed. This may bias the use of the protein crystal structures by molecular biologists. Here we observe that in the large majority of the protein crystal structures strings of residues are missing. Polar residues incline to occur in missing strings together with glycine, while apolar and aromatic residues tend to avoid them. Particularly flexible residues, as shown by their extremely high B-factors, by their exposure to the solvent and by their secondary structures, flank the missing strings. These data should be a helpful guideline for crystallographers that encounter regions of flat and uninterpretable electron density as well as end-users of crystal structures.
Keywords: conformational disorder, disordered proteins, missing atoms, missing residues, protein crystal structure, protein data bank, structural bioinformatics, unfolded proteins
Introduction
Over a hundred thousand of macromolecular crystal structures have been determined and deposited in the Protein Data Bank (PDB)1,2 during the last few decades. This impressive amount of information allows the design of powerful knowledge-based tools in many biochemistry and biophysics fields.3 For this reason it is essential that the quality of the data distributed by the PDB is extremely high.4
An aspect that is often ignored by the users of the PDB data is that a protein structure may be incomplete: the positions of one or more atoms or residues are not declared, though these atoms are present in the material that was analyzed.
If this is not a serious problem for crystallographers, who generally interpret the absence of electron density by excessive conformational disorder, it is on the contrary a major problem for the end-users of the PDB data. For example, the presence of gaps along the sequence may cause a misinterpretation of electrostatic potentials at the protein surface. Docking simulations might be biased by the absence of some protein moiety. Statistical trends mined from the PDB, especially those concerning the protein surface, might be inaccurate if the absence of some residues is not properly handled.
It is therefore essential to analyze and consider the occurrence of these “missing” residues and strings of residues. Here we report a statistical survey of missing strings of residues in the PDB and of the residues that flank these strings.
We observe that they are extremely frequent. More than 80% of the structures refined at a resolution lower than 1.75 Å contain at least one missing string and about 20% of the structures refined at a resolution better than 0.75 Å contain at least one missing string. We report the statistical propensities of the amino acids to occur in these strings and several statistical descriptions of the residues that precede or follow them.
Results and Discussion
Definition of missing string
We classified the strings of missing residues as N-terminal, when they occur at the beginning of the polypeptide chain, as C-terminal, when they are at the end of the polypeptide chain, or as internal, in the other cases (see Fig. 1). Moreover, each string was associated with its length, measured by the number of residues that it contains.
Figure 1.

Definition of string of missing residues.
Frequency of missing strings
Missing strings are frequent in protein structures: on average, 69% of the PDB files have missing strings. This percentage varies with the crystallographic resolution (see Table 1). At very high resolution, higher than 0.75 Å, only one fifth of the PDB files contain one or more missing strings. This percentage increases and reaches a value close to 80% if the resolution extents to 2.0 Å or to higher values. Moreover, on average, less than 1% of the residues are missing in structures refined at a resolution of at least 0.75 Å. This percentage increases if the resolution worsens. On average, about 10% of the residues are missing at very low resolutions, above 3.25 Å.
Table 1.
Frequency of missing strings.
| Resolution (Å) | % PDB files with at least one missing strings | % of residues that are missing (standard errors in parentheses) |
|---|---|---|
| < 0.75 | 21 | 0.4(0.3) |
| 0.75-1.00 | 40 | 1.8(0.3) |
| 1.00-1.25 | 62 | 3.4(0.2) |
| 1.25-1.50 | 73 | 4.9(0.1) |
| 1.50-1.75 | 78 | 5.5(0.1) |
| 1.75-2.00 | 81 | 6.3(0.1) |
| 2.00-2.25 | 83 | 7.0(0.1) |
| 2.25-2.50 | 83 | 7.5(0.1) |
| 2.50-2.75 | 83 | 7.6(0.1) |
| 2.75-3.00 | 84 | 8.1(0.2) |
| 3.00-3.25 | 83 | 8.5(0.2) |
| 3.25-3.50 | 81 | 9.0(0.3) |
| > 3.50 | 83 | 9.4(0.4) |
By focusing the attention on the structures that have at least one missing string, it appears that at high resolution there are few missing strings per structure: only one at resolution of at least 0.75 Å and more than 3 at resolution worse than 2.75 Å (see Table 2). At high resolution, missing strings are also shorter: while they contain about 4 residues at resolution of at least 0.75 Å, they contain, on average, more than 13 residues at resolution worse than 2.75 Å. Low resolution structures have therefore more and longer missing strings.
Table 2.
Average number of missing strings per PDB file and average number of residues per string (standard errors in parentheses).
| Resolution (Å) | Average number of missing strings per PDB file string | Average number of residues per string |
|---|---|---|
| < 0.75 | 1.00(0.00) | 4.5(1.5) |
| 0.75-1.00 | 1.41(0.06) | 5.4(0.8) |
| 1.00-1.25 | 1.66(0.06) | 7.2(0.4) |
| 1.25-1.50 | 1.92(0.11) | 8.1(0.3) |
| 1.50-1.75 | 1.95(0.06) | 9.3(0.4) |
| 1.75-2.00 | 2.15(0.05) | 10.0(0.4) |
| 2.00-2.25 | 2.35(0.05) | 11.3(0.5) |
| 2.25-2.50 | 2.85(0.10) | 11.7(0.5) |
| 2.50-2.75 | 2.89(0.11) | 11.7(0.5) |
| 2.75-3.00 | 3.88(0.15) | 13.9(1.0) |
| 3.00-3.25 | 3.85(0.15) | 15.0(1.1) |
| 3.25-3.50 | 4.15(0.22) | 14.9(1.2) |
| >3.50 | 4.09(0.21) | 15.6(1.5) |
Although these trends cannot be considered unexpected, their interpretation might be ambivalent. On the one hand, one might suppose that at low resolution, the paucity of experimental data may hinder the determination of some structure moieties. On the other hand, it is also possible that the presence of too many disordered regions will hinder the collection of high-resolution diffraction data. Interestingly, both hypotheses are not mutually exclusive.
Importantly, we did not observe any relationship between the presence of missing strings and the crystal system, the space group, or the occurrence of specific types of symmetry elements.
Also Le Gall and colleagues observed, few years ago, the high rate of structures of the Protein Data Bank with incomplete atom coordinates: in their data set, only 7% of the protein structures have a complete list of atom coordinates.5 A further publication by Mohan and coworkers, based again on the analysis of the Protein Data Bank structures, showed that disordered protein regions are sensitive to changes in amino acid sequence and to experimental conditions of crystallogenesis.6
Amino acid propensity of missing strings
We computed the intrinsic propensity of each amino acid to occur in a missing string. The propensity paa of an amino acid aa is
where naa,string is the number of aa that occur in a missing string, nx,string is the number of amino acids that occur in missing strings, naa is the total number of aa, and nx is the total number of residues. A paa value higher than one indicates that the residue aa tends to be in a missing string, while a paa value lower than one indicates that the residue aa tends to avoid missing strings. On the contrary, the residue aa does not tend to be in a missing string and does not tend to avoid missing strings if its paa value is equal to one.7
We computed 4 propensities for each amino acid: the propensity to be in any type of string (pany), the propensity to be in an internal string (pi), the propensity to be in an N-terminal string (pN), and the propensity to be in a C-terminal string (pC). Moreover, we computed these propensities for missing strings of different length: less than 3 residues, 4-6 residues, 7-15 residues, and more than 15 residues. Table 3 shows all these propensities. We verified that the resolution of the crystal structures has a minor influence on these values (data not shown) and the data of Table 3 refer to all the structures, independently of their resolution.
Table 3.
Amino acid propensities to be in a missing string.
| < 4 residues |
4-6 residues |
7-15 residues |
> 15 residues |
||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| aa | pany | pi | pN | pC | pany | pi | pN | pC | pany | pi | pN | pC | pany | pi | pN | pC | |||
| Gly | 1.88 | 2.32 | 1.88 | 1.45 | 1.81 | 2.13 | 1.78 | 1.23 | 1.29 | 1.64 | 1.26 | 1.47 | 1.25 | 1.52 | 1.24 | 1.09 | |||
| Ala | 0.86 | 0.86 | 0.83 | 0.87 | 1.07 | 0.98 | 1.10 | 1.19 | 0.87 | 1.03 | 0.82 | 1.21 | 0.96 | 1.06 | 0.86 | 1.04 | |||
| Ser | 1.66 | 1.47 | 1.86 | 1.50 | 1.82 | 1.70 | 2.06 | 1.71 | 1.44 | 1.65 | 1.68 | 1.60 | 1.55 | 1.64 | 1.66 | 1.39 | |||
| Thr | 0.68 | 0.95 | 0.49 | 0.71 | 1.04 | 1.17 | 0.78 | 1.18 | 0.86 | 1.20 | 0.71 | 0.95 | 0.96 | 1.00 | 0.88 | 1.03 | |||
| Cys | 0.26 | 0.35 | 0.14 | 0.38 | 0.39 | 0.48 | 0.20 | 0.51 | 0.33 | 0.43 | 0.22 | 0.48 | 0.51 | 0.38 | 0.48 | 0.60 | |||
| Val | 0.33 | 0.35 | 0.23 | 0.46 | 0.50 | 0.52 | 0.38 | 0.67 | 0.52 | 0.64 | 0.44 | 0.67 | 0.70 | 0.71 | 0.67 | 0.70 | |||
| Leu | 0.33 | 0.41 | 0.17 | 0.52 | 0.43 | 0.42 | 0.35 | 0.55 | 0.48 | 0.56 | 0.32 | 0.86 | 0.72 | 0.70 | 0.68 | 0.74 | |||
| Ile | 0.28 | 0.42 | 0.12 | 0.39 | 0.42 | 0.40 | 0.31 | 0.62 | 0.45 | 0.60 | 0.40 | 0.52 | 0.58 | 0.66 | 0.51 | 0.60 | |||
| Met | 5.02 | 0.69 | 10.19 | 0.59 | 2.48 | 0.59 | 5.87 | 0.65 | 1.33 | 0.81 | 2.93 | 0.52 | 1.43 | 0.96 | 1.94 | 0.78 | |||
| Pro | 0.82 | 1.11 | 0.45 | 1.16 | 1.00 | 1.08 | 0.84 | 1.07 | 1.07 | 1.33 | 0.99 | 1.40 | 1.15 | 1.33 | 1.01 | 1.28 | |||
| Phe | 0.19 | 0.35 | 0.04 | 0.27 | 0.39 | 0.50 | 0.17 | 0.52 | 0.38 | 0.51 | 0.29 | 0.48 | 0.59 | 0.65 | 0.57 | 0.54 | |||
| Tyr | 0.26 | 0.45 | 0.09 | 0.36 | 0.35 | 0.40 | 0.20 | 0.47 | 0.39 | 0.57 | 0.24 | 0.51 | 0.53 | 0.64 | 0.47 | 0.50 | |||
| Trp | 0.19 | 0.36 | 0.05 | 0.26 | 0.29 | 0.34 | 0.13 | 0.42 | 0.29 | 0.41 | 0.24 | 0.29 | 0.46 | 0.50 | 0.38 | 0.55 | |||
| Asp | 0.83 | 1.45 | 0.41 | 0.91 | 1.13 | 1.48 | 0.66 | 1.22 | 0.93 | 1.40 | 0.71 | 0.92 | 0.89 | 1.20 | 0.70 | 1.00 | |||
| Glu | 0.92 | 1.48 | 0.46 | 1.11 | 1.20 | 1.36 | 0.77 | 1.58 | 1.09 | 1.33 | 0.71 | 1.98 | 1.05 | 1.32 | 0.87 | 1.17 | |||
| Asn | 1.26 | 1.78 | 0.90 | 1.32 | 1.34 | 1.65 | 0.94 | 1.42 | 0.92 | 1.33 | 0.76 | 0.93 | 0.94 | 1.25 | 0.81 | 0.93 | |||
| Gln | 0.97 | 1.34 | 0.49 | 1.42 | 0.99 | 1.21 | 0.63 | 1.13 | 0.90 | 1.15 | 0.73 | 1.25 | 1.07 | 1.14 | 1.01 | 1.10 | |||
| His | 1.42 | 1.25 | 0.68 | 2.91 | 1.25 | 1.08 | 1.60 | 1.01 | 4.69 | 0.99 | 5.31 | 1.41 | 2.68 | 0.96 | 3.64 | 2.69 | |||
| Lys | 1.06 | 1.55 | 0.28 | 1.90 | 1.23 | 1.49 | 0.68 | 1.64 | 0.97 | 1.32 | 0.74 | 1.27 | 1.02 | 1.26 | 0.85 | 1.14 | |||
| Arg | 0.79 | 1.07 | 0.24 | 1.47 | 0.90 | 1.01 | 0.56 | 1.21 | 0.81 | 1.09 | 0.51 | 1.27 | 0.96 | 1.13 | 0.79 | 1.13 | |||
Since we are not looking at biological questions but we focus the attention on the occurrence of missing residues in crystal structures, we examined the amino acid sequence of the proteins that were crystallized and we did not eliminate artificial tags (for example poly-histidine-tags) that may be present at the N- and C-termini of the polypeptide chains.
Since N- and C-termini are often conformationally quite mobile, it is thus not surprising that histidine, which is often introduced for affinity purification purposes, shows a considerable propensity to occur in missing strings. However, histidine is quite common also in internal missing strings and not only in N- and C-terminal strings.
Analogously, methionine, which is often the first residue in the protein sequence, has a particularly high tendency to occur in N-terminal missing strings. Contrary to histidine, methionine shows a very modest propensity to be in internal missing strings, especially when they are short.
The two residues with extreme and opposite flexibilities, proline and glycine, present an interesting behavior. The flexible glycine has a high propensity to be part of missing strings, especially if they are short and internal. The rigid proline has a pronounced tendency to occur in missing strings too, especially if they are long and internal or C-terminal. It is reasonable to argue that the presence of a proline confers some rigidity to the missing strings of residues. However, this rigidity may not be associated with a firm anchoring of the string to the rest of the protein, which is well folded and compact; as a consequence, the proline containing string might oscillate considerably as a partially rigid fragment.
Apolar and aromatic residues do not tend to occur in missing strings. This seems to be a consequence of the fact that missing strings tend to map to the protein surface and are well exposed to the solvent, while apolar residues are seldom present in these regions. On the contrary, polar residues are frequent in missing strings, especially serine, which is the smallest, while threonine and glutamine seem reluctant to be there. Charged residues have a remarkable tendency to be in internal and C-terminal missing strings, while N-terminal strings seldom include them. Although we do not have a conclusive explanation, we remark that this is clearly noticeable when lysine and arginine are in short N-terminal strings, when this would bring close together 2 cationic charges, one of the amino-terminus of the sequence and the other of the lysine or arginine side-chain.
In essence, it seems that the propensity to be in a missing string strongly depends on the polarity of the residues, with the exception of methionine and histidine that are frequent in N- and/or C-terminal missing segments. In fact, the relationship between residue hydrophobicity and residue propensity to be in a missing string is quite strong (see Figure 2; y = 0.75 – 0.32 x; correlation coefficient = 0.672).
Figure 2.

Relationship between amino acid hydrophobicity and propensity to occur in a missing string (of any length and any type). Similar plots are obtained by considering only certain types of missing strings (internal, N-terminal, C-terminal; containing 1-3 residues, 4-6 residues, containing 7-15 residues, containing more than 15 residues). The hydrophobicity values were taken from.19
Interestingly, Radivojac and colleagues showed that the amino acidic composition of the missing strings of residues, especially if they are short, are similar to the protein segments characterized by large B-factor values and different from the protein segments of low B-factor values.8
The amino acid propensities to be in a missing string correlate quite well with the TOP-IDP amino acid scale that discriminates between order and disorder,9 despite the latter one is based not only on the Protein Data Bank but also on other data (for example, circular dichroism or protease digestion). Apart the methionine, which is extremely frequent in the N-terminal missing strings despite its small TOP-IDP value, discrepancies are observed also for serine, glycine, asparagine, and histidine. The frequency of these residues in missing strings is higher than it is expected based on their TOP-IDP values. However, the relationship between the propensities of the amino acids to be in a missing string and the TOP-IDP values does not seem to be linear. For example a correlation coefficient equal to 0.74 is associated with the relationship y = 0.61 e1.11x (y = propensity to be in any type of missing string shorter than 4 amino acids and x = TOP-IDP values; all the residues but not methionine).
B-factors around missing strings
B-factors monitor the amplitude of the oscillation of an atom around its equilibrium position. However, their value increases also if the atom is conformationally disordered and larger values are usually observed in loops rather than in helices and strands.10 Their values must be normalized to zero mean and unit variance as
where B is the B-factor, BN is the normalized B-factor, Bave is the average B-factor of n atoms, and Bstd is its standard deviation, since they may vary from one structure to the next not because of genuine physical effects.11-13 Table 4 shows the values of the normalized B-factors of the 3 residues that precede or follow a missing string. Since the BN values may depend on the type of residue and vary if the side-chain varies, we considered only the BN values of the Calpha atoms.
Table 4.
Average values of the normalized B-factors of the 3 residues that precede or follow a missing string. Only the B-factors of the Calpha atoms are considered. Missing strings of any length are considered. Standard errors are in parentheses.
| Position |
||||||
|---|---|---|---|---|---|---|
| Resolution (Å) | −3 | −2 | −1 | +1 | +2 | +3 |
| Internal missing string | ||||||
| any | 1.23(0.03) | 1.79(0.03) | 2.36(0.03) | 2.25(0.03) | 1.76(0.03) | 1.22(0.03) |
| 0.0-1.5 | 1.02(0.08) | 1.86(0.09) | 2.96(0.11) | 2.94(0.10) | 1.86(0.09) | 1.05(0.07) |
| 1.5-2.0 | 1.16(0.04) | 1.82(0.04) | 2.57(0.04) | 2.44(0.04) | 1.77(0.04) | 1.12(0.04) |
| 2.0-2.5 | 1.20(0.04) | 1.71(0.04) | 2.19(0.04) | 2.10(0.04) | 1.70(0.04) | 1.21(0.04) |
| >2.5 | 1.48(0.09) | 1.84(0.09) | 2.07(0.09) | 1.93(0.09) | 1.78(0.08) | 1.44(0.08) |
| C-terminal missing string | ||||||
| any | 1.14(0.03) | 1.78(0.03) | 2.61(0.03) | — | — | — |
| 0.0-1.5 | 1.01(0.07) | 1.74(0.07) | 3.14(0.09) | — | — | — |
| 1.5-2.0 | 1.13(0.04) | 1.86(0.04) | 2.77(0.05) | — | — | — |
| 2.0-2.5 | 1.25(0.04) | 1.81(0.05) | 2.44(0.05) | — | — | — |
| >2.5 | 1.05(0.08) | 1.51(0.09) | 1.88(0.09) | — | — | — |
| N-terminal missing string | ||||||
| any | — | — | — | 2.51(0.03) | 1.69(0.02) | 1.01(0.02) |
| 0.0-1.5 | — | — | — | 3.33(0.08) | 1.91(0.07) | 1.01(0.06) |
| 1.5-2.0 | — | — | — | 2.63(0.04) | 1.76(0.04) | 1.01(0.03) |
| 2.0-2.5 | — | — | — | 2.10(0.05) | 1.53(0.04) | 0.95(0.04) |
| >2.5 | — | — | — | 1.80(0.09) | 1.49(0.09) | 1.09(0.09) |
It is apparent that the residues that are closer to the missing strings have higher BN values. For example the first residue that precedes an internal missing string has, on average, a BN value of 2.6, the second of 1.8 and the third of 1.2; analogously, the first residue that follows an internal missing string has, on average, a BN value of 2.3, the second of 1.8, and the third or 1.2. The BN values of the residues that precede an internal missing string are nearly identical to the BN values of the residues that follow an internal missing string.
Similar trends are observed for N-terminal and C-terminal missing strings and if considering separately various crystallographic resolutions ranges. The only remarkable difference is that the variance of the BN values is larger at higher resolutions, where the difference between the highest and the lowest BN values is larger, than at low resolutions.
The residues flanking a missing string have BN values that are among the highest BN values observed in the crystal structure (see Table 5). For example, only 8% of the residues have a BN value higher than the first residue that precedes an internal missing string and 9% of the residues have a BN value higher that the first residue after an internal missing string. These percentages tend to increase as one moves away from the missing string, since the BN values tend to decrease in the same sense. Similar trends are shown by N-terminal and C-terminal missing strings. Furthermore, the percentage of residues with higher BN values than the residues that are just before or after a missing string tends to increase if the resolution decreases.
Table 5.
Percentage of residues that have a BN larger than the BNs of the 3 residues that precede or follow a missing string. Only the B-factors of the Calpha atoms are considered. Missing strings of any length are considered. Standard errors are in parentheses.
| Position |
||||||
|---|---|---|---|---|---|---|
| Resolution (Å) | −3 | −2 | −1 | +1 | +2 | +3 |
| Internal missing strings | ||||||
| any | 21.35(0.38) | 13.30(0.29) | 8.40(0.24) | 9.26(0.25) | 13.86(0.30) | 21.62(0.39) |
| 0.0-1.5 | 23.81(1.49) | 12.44(1.04) | 5.79(0.83) | 5.24(0.71) | 11.57(0.93) | 21.26(1.25) |
| 1.5-2.0 | 20.96(0.63) | 11.68(0.43) | 5.64(0.26) | 6.63(0.30) | 12.25(0.44) | 21.88(0.65) |
| 2.0-2.5 | 20.28(0.60) | 12.96(0.46) | 8.50(0.38) | 8.76(0.35) | 13.25(0.48) | 20.32(0.62) |
| >2.5 | 22.87(0.88) | 16.82(0.75) | 13.69(0.69) | 15.86(0.75) | 18.37(0.81) | 23.55(0.91) |
| C-terminal missing strings | ||||||
| any | 21.06(0.42) | 12.63(0.31) | 6.88(0.24) | — | — | — |
| 0.0-1.5 | 21.54(1.01) | 11.64(0.72) | 4.23(0.36) | — | — | — |
| 1.5-2.0 | 20.91(0.63) | 12.04(0.46) | 5.93(0.32) | — | — | — |
| 2.0-2.5 | 19.57(0.75) | 12.15(0.55) | 7.05(0.42) | — | — | — |
| >2.5 | 24.60(1.30) | 16.98(1.06) | 12.75(1.01) | — | — | — |
| N-terminal missing strings | ||||||
| any | — | — | — | 7.94(0.22) | 14.03(0.30) | 23.67(0.40) |
| 0.0-1.5 | — | — | — | 4.41(0.36) | 11.73(0.63) | 22.93(0.93) |
| 1.5-2.0 | — | — | — | 6.38(0.26) | 12.74(0.40) | 23.25(0.58) |
| 2.0-2.5 | — | — | — | 9.86(0.47) | 15.34(0.59) | 23.69(0.75) |
| >2.5 | — | — | — | 14.98(1.06) | 19.58(1.20) | 26.51(1.36) |
Given that the BN values monitor the local flexibility of the protein atoms, these observations suggest that the flexibility of the polypeptide chain increases as one approaches the missing strings. This is not surprising and supports the commonly accepted opinion that it is the excessive conformational disorder that hampers the localization of the missing strings.
Interestingly, the same BN trends are observed for missing strings of different length (data not shown), suggesting that their flexibility and conformational disorder is comparable and independent of their dimension.
Secondary structures around missing strings
We further investigated the backbone conformation of the residues that flank missing strings (see Table 6). We adopted a 3 state classification of the secondary structures: any type of helix (label H = helix), any type of extended conformation (label E = extended) and all the rest (label L = loop). In the large majority of the cases, the first residue before and the first residue after the missing string adopt a loop conformation. Also the second and the third residues that precede the missing string tend to adopt a loop conformation, though this preference becomes less marked as one moves away from the missing string. Analogously, the preference for a loop conformation of the residues that follow the missing strings decreases as one moves away from the missing string.
Table 6.
Secondary structures of the residues that precede or follow internal missing strings.
| Position |
||||||
|---|---|---|---|---|---|---|
| Sec. str. type | −3 | −2 | −1 | +1 | +2 | +3 |
| Internal missing strings | ||||||
| H: | 23 | 15 | 8 | 7 | 15 | 22 |
| E: | 27 | 16 | 2 | 1 | 14 | 26 |
| L: | 50 | 69 | 90 | 92 | 71 | 52 |
| N-terminal missing strings | ||||||
| H: | — | — | — | 6 | 13 | 20 |
| E: | — | — | — | 0 | 13 | 23 |
| L: | — | — | — | 94 | 74 | 57 |
| C-terminal missing strings | ||||||
| H: | 39 | 27 | 0 | — | — | — |
| E: | 19 | 12 | 0 | — | — | — |
| L: | 42 | 61 | 100 | — | — | — |
We do not give here separate data for missing strings of different length and for crystal structures at different resolution, since the impact of both string length and resolution is insignificant.
These data support the hypothesis that missing strings tend to be located in loops, which are more flexible and can be conformationally disordered. They agree with the hypothesis that the reason of the occurrence of missing strings is their considerable conformational disorder.
Solvent accessibility around missing strings
Table 7 shows the relative solvent accessible surface areas (SASA) of the 3 residues just before or after missing strings. Relative SASA are computed by fixing at 100 the maximal SASA of the residue. Therefore, relative SASA values are expected to range from 0, when a residue is completely buried into the protein interior, to 100, when a residue is completely exposed to the solvent. However, a residue may have relative SASA values slightly higher than 100, since the reference value of the maximal SASA is defined empirically and arbitrarily. Here, for this reference value we selected the SASA value observed in the extended 3-peptide Ala-X-Ala.14
Table 7.
Average SASA values of the residues that precede or follow internal missing strings (standard errors in parentheses).
| Position |
||||||
|---|---|---|---|---|---|---|
| String | −3 | −2 | −1 | +1 | +2 | +3 |
| Internal | 40.7(0.5) | 48.9(0.5) | 82.8(0.6) | 83.2(0.6) | 47.7(0.5) | 39.9(0.5) |
| N-teminal | — | — | — | 93.3(0.5) | 51.9(0.5) | 44.1(0.5) |
| C-terminal | 46.8(0.6) | 54.3(0.5) | 88.2(0.6) | — | — | — |
It appears from Table 7, that the first residues before and after the missing string have a very high relative SASA, often close to 100. We remark that this high value is in part, at least, due to the absence of the residues of the missing string: these 2 residues that flank the missing string mimic therefore a N- and a C-terminal residue and are considered terminal residues by the program that computer the relative SASA values. However, large relative SASA values are observed, on average, also for the second and third residues that precede or follow the missing string.
The high solvent accessibility of the 3 residues flanking the missing strings suggests the hypothesis that missing strings tend to be largely solvent exposed. This agrees with the observations reported above that missing strings then to be hydrophilic, flexible, and conformationally disordered.
We do not provide separate data for missing string of different length or for crystal structures refined at different resolution since the impact of both resolution and string length is irrelevant.
Distance between missing strings and crystal packing contacts
The degree of solvent accessibility, examined in the previous chapter, is based on the analysis of the protein, independently of its neighbors that form crystal packing contacts in the crystal used to determine its 3-dimensional structure. Therefore, it is possible that a solvent exposed residue is involved in a crystal packing interaction and is not, as a consequence, really exposed to the solvent.
For this reason we examined the relationship between missing strings and crystal packing contacts. We computed the following quantities: the minimal Euclidean distance between the Calpha atom of the residue that precedes the missing string and the Calpha atom of any residue that is involved in a crystal packing contact and that can belong to the same molecule or to a symmetry related molecule (D3_before); the minimal Euclidean distance between the Calpha atom of the residue that follows the missing string and the Calpha atom of any residue that is involved in a crystal packing contact and that can belong to the same molecule or to a symmetry related molecule (D3_after); the minimal sequence separation between the residue that precedes the missing string and a residue that is involved in a crystal packing contact and belongs to the same protein chains (D1_before); and the minimal sequence separation between the residue that follows the missing string and a residue that is involved in a crystal packing contact and belongs to the same protein chains (D1_after). Table 8 shows the average values of these quantities, together with their standard errors.
Table 8.
Average Euclidean distances between the residues just before (D3_before) and after (D3_after) a missing string and a residue involved in a crystal packing contact and average sequence distance between the residues just before (D1_before) and after (D1_after) a missing string and a residue involved in a crystal packing contact (standard errors in parentheses).
| Resolution (Å) | D3_before | D3-after | D1_before | D1_after |
|---|---|---|---|---|
| Internal missing strings | ||||
| any | 17.8(0.7) | 21.7(0.9) | 14.1(0.7) | 14.5(0.5) |
| 0.0-1.5 | 31.7(4.0) | 34.7(4.8) | 11.0(3.1) | 13.2(1.0) |
| 1.5-2.0 | 15.4(1.1) | 14.7(0.9) | 10.1(0.4) | 10.2(0.4) |
| 2.0-2.5 | 16.3(1.0) | 16.1(0.9) | 14.6(0.8) | 15.5(0.7) |
| >2.5 | 17.5(1.4) | 22.3(1.8) | 22.2(2.5) | 20.1(1.6) |
| N-terminal missing strings | ||||
| any | — | 23.7(1.1) | — | 9.2(0.3) |
| 0.0-1.5 | — | 26.4(2.7) | — | 7.2(0.5) |
| 1.5-2.0 | — | 24.5(1.6) | — | 8.1(0.4) |
| 2.0-2.5 | — | 23.7(2.0) | — | 11.0(0.6) |
| >2.5 | — | 19.6(2.6) | — | 15.2(1.7) |
| C-terminal missing strings | ||||
| any | 17.7(1.0) | — | 7.4(0.4) | — |
| 0.0-1.5 | 22.6(2.8) | — | 5.0(0.3) | — |
| 1.5-2.0 | 16.3(1.4) | — | 6.6(0.6) | — |
| 2.0-2.5 | 13.5(1.2) | — | 8.4(0.5) | — |
| >2.5 | 22.9(3.4) | — | 10.0(0.9) | — |
On average, missing strings are far from crystal packing contacts and the Euclidean distances (D3_before and D3_after) between internal missing strings and crystal packing contacts increase with improving resolution. It is then reasonable to suppose that crystals, which diffract to higher resolution, display conformational disorder far from the solid state intermolecular contacts. This trend is not observed for N-terminal and C-terminal missing strings. However, this might reflect their role in crystallogenesis.15
Surprisingly, we observed an opposite trend in the case of the sequence distance between missing strings and residues of the same polypeptide chains that are involved in crystal packing interactions. These sequence separations tend to be slightly larger in crystal structures at low resolution. However, this might reflect the fact (not shown here) that the structures of larger proteins are often refined at lower resolution than the structures of smaller proteins.
Conclusions
This survey has shown that most of the protein crystal structures of the Protein Data Bank are incomplete, since some residues remain elusive, even at very high resolution. This must be considered with extreme attention, especially by the end-users of the crystal structures, for example structural bioinformaticians and molecular biologists, since an incomplete structure may provide erroneous information, for example when determining statistical trends or electrostatic potentials.
Polar residues incline to occur in missing strings together with glycine and proline, while apolar and aromatic residues tend to avoid them. Particularly flexible residues, as shown by their high B-factors, by their exposure to the solvent and by their secondary structures, flank the missing strings.
The data reported here should prove useful for crystallographers when interpreting electron density maps. It is in fact rather arbitrary to decide if a residue can be traced or not, especially at low resolution. One option is to model residues, even if they are invisible, and allow their B-factors to rise to astronomical levels. The other is to ignore the invisible residues and eliminate completely them from the file deposited in the Protein Data Bank. The information that we have summarized here may be used as a semi-quantitative guideline for macromolecular crystallographers to decide when a residue is really invisible.
Materials and Methods
We downloaded a subset of the crystal structures deposited in the Protein Data Bank, 1,2 disregarding structures containing nucleic acids and structures refined at a resolution worse than 3.0 Å. We retained only monomeric proteins containing one molecule in the asymmetric unit, in order to exclude structures where inter-molecular interactions different from crystal packing interactions might influence conformational disorder. We reduced redundancy drastically with a threshold of pairwise sequence identity of 30%.
Residues that are deposited in the PDB with zero occupancy are listed in the PDB files in the lines that begin with the label REMARK 475 and residues, the position of which was not determined, are listed the PDB files in the lines that begin with the label REMARK 465. Therefore, we scanned the PDB files in search of these lines.
We used Stride to assign secondary structures to the residues 16 and NACCESS to measure the solvent accessible area surfaces of the residues.17 We used the program CPC to identify the residues involved in crystal packing interactions.18
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Acknowledgments
KDC and OC conceived and designed the work. OC performed the computations.KDC and OC wrote the manuscript. We acknowledge the colleagues of the Department of Structural and Computational Biology for helpful comments.
Ethics Declaration
The procedures described in the present paper do not need to be approved by a Ethics Committee of Human Experimentation.
References
- 1.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000; 28:235-42; PMID:10592235; http://dx.doi.org/ 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bernstein FC, Koetzle TF, Williams GJ, Meyer EF Jr., Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol 1977; 112:535-42; PMID:875032; http://dx.doi.org/ 10.1016/S0022-2836(77)80200-3 [DOI] [PubMed] [Google Scholar]
- 3.Carugo O, Pongor S. The evolution of structural databases. Trends Biotechnol 2002; 20:498-501; PMID:12443870; http://dx.doi.org/ 10.1016/S0167-7799(02)02082-6 [DOI] [PubMed] [Google Scholar]
- 4.Dodson EJ, Davies GJ, Lamzin VS, Murshudov GN, Wilson KS. Validation tools: can they indicate the information content of macromolecular crystal structures? Structure 1998; 6:685-90; PMID:9655828; http://dx.doi.org/ 10.1016/S0969-2126(98)00070-7 [DOI] [PubMed] [Google Scholar]
- 5.Le Gall T, Romero PR, Cortese MS, Uversky VN, Dunker AK. Intrinsic disorder in the Protein Data Bank. J Biomol Struct Dyn 2007; 24:325-41; PMID:17206849; http://dx.doi.org/ 10.1080/07391102.2007.10507123 [DOI] [PubMed] [Google Scholar]
- 6.Mohan A, Uversky VN, Radivojac P. Influence of sequence changes and environment on intrinsically disordered proteins. Plos Comput Biol 2009; 5:e1000497; PMID:19730682; http://dx.doi.org/ 10.1371/journal.pcbi.1000497 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Tramontano A. Bioinformatica. Rome: Zanichelli, 2002. [Google Scholar]
- 8.Radivojac P, Obradovic Z, Smith DK, Zhu G, Vucetic S, Brown CJ, Lawson JD, Dunker AK. Protein flexibility and intrinsic disorder. Protein Sci 2004; 13:71-80; PMID:14691223; http://dx.doi.org/ 10.1110/ps.03128904 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Campen A, Williams RM, Brown CJ, Meng J, Uversky VN, Dunker AK. TOP-IDP-Scale: a new amino acid scale measuring propensity for intrinsic disorder. Protein Pept Lett 2008; 15:956-63; PMID:18991772; http://dx.doi.org/ 10.2174/092986608785849164 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Petsko GA, Ringe D. Fluctuations in protein structure from X-ray diffraction. Annu Rev Biophys Bioeng 1984; 13:331-71; PMID:6331286; http://dx.doi.org/ 10.1146/annurev.bb.13.060184.001555 [DOI] [PubMed] [Google Scholar]
- 11.Carugo O. Correlation between occupancy and B factor of water molecules in protein crystal structures. Protein Eng 1999; 12:1021-4; PMID:10611392; http://dx.doi.org/ 10.1093/protein/12.12.1021 [DOI] [PubMed] [Google Scholar]
- 12.Carugo O, Argos P. Correlation between side chain mobility and conformation in protein structures. Protein Eng 1997; 10:777-87; PMID:9342144; http://dx.doi.org/ 10.1093/protein/10.7.777 [DOI] [PubMed] [Google Scholar]
- 13.Carugo O, Argos P. Reliability of atomic displacement parameters in protein crystal structures. Acta Crystallogr D Biol Crystallogr 1999; 55 (Pt 2):473-8; PMID:10089358; http://dx.doi.org/ 10.1107/S0907444998011688 [DOI] [PubMed] [Google Scholar]
- 14.Hubbard SJ, Campbel SF, Thornton JM. Molecular recognition. Conformational analysis of limited proteolytic sites and serine proteinase protein inhibitors. J Mol Biol 1991; 220:507-30; PMID:1856871; http://dx.doi.org/ 10.1016/0022-2836(91)90027-4 [DOI] [PubMed] [Google Scholar]
- 15.Carugo O. Participation of protein sequence termini in crystal contacts. Protein Sci 2011; 20:2121-4; PMID:21739502; http://dx.doi.org/ 10.1002/pro.690 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Frishman D, Argos P. Knowledge-based protein secondary structure assignment. Proteins 1995; 23:566-79; PMID:8749853; http://dx.doi.org/ 10.1002/prot.340230412 [DOI] [PubMed] [Google Scholar]
- 17.Hubbard SJ, Thornton JM. NACCESS, Department of Biochemistry and Molecular Biology, University College London. 1993. [Google Scholar]
- 18.Carugo O, Djinovic-Carugo K. How many packing contacts are observed in protein crystals? J Struct Biol 2012; 180:96-100; PMID:22634724; http://dx.doi.org/ 10.1016/j.jsb.2012.05.009 [DOI] [PubMed] [Google Scholar]
- 19.Carugo O. Prediction of polypeptide fragments exposed to the solvent. In Silico Biology 2003; 3:0035. [PubMed] [Google Scholar]
