Total number of training and testing examples for chemical shift prediction for each atom type. The training set is comprised of the combination of the SPARTA+ training set and the training and testing set for SHIFTX+, and removing all redundant chains. We have developed a new test set comprised of 200 high-resolution proteins with chemical shifts available from RefDB; the test data eliminates duplicate chains, and residues with no deposited chemical shift values. The LH-Test set refers to the subset of the total set of test proteins with only low sequence homology to other proteins such that sequence or structural homology cannot be exploited. We also created two curated test sets which additionally exclude paramagmetic proteins, some Hα chemical shifts that have calculated ring current effect exceeding 1.5 ppm, and “outliers” detected by the PANAV program (13). Further information is provided in Methods and ESI.
# of PDBs | H | Hα | C′ | Cα | Cβ | N | |
---|---|---|---|---|---|---|---|
Train | 647 | 72 894 | 56 149 | 58 228 | 79 611 | 70 621 | 74 896 |
Test | 200 | 19 120 | 11 727 | 8231 | 13 140 | 10 139 | 15 374 |
Test (curated) | 200 | 18 494 | 11 240 | 7861 | 12 533 | 9883 | 14 610 |
LH-Test | 100 | 8634 | 4979 | 3332 | 5685 | 4278 | 6576 |
LH-Test (curated) | 100 | 8606 | 4950 | 3331 | 5251 | 4201 | 6480 |