Table 1.
Cross-validation of various input k-mer sequences
| Model | Test set | S5F (5-mer) | DeepSHM (5-mer) | DeepSHM (9-mer) | DeepSHM (15-mer) | DeepSHM (21-mer) |
|---|---|---|---|---|---|---|
| Substitution rate | IGHV1 | 0.52 | 0.57 | 0.57 | 0.56 | 0.56 |
| IGHV3 | 0.51 | 0.54 | 0.55 | 0.55 | 0.54 | |
| IGHV4 | 0.54 | 0.57 | 0.57 | 0.58 | 0.57 | |
| IGHV2, 5, 6, 7 | 0.53 | 0.55 | 0.56 | 0.56 | 0.55 | |
| Avg correlation | 0.52 | 0.56 | 0.56 | 0.56 | 0.55 | |
| Best - S5F | NA | 0.04 | 0.04 | 0.04 | 0.03 | |
| Mean - S5F | NA | 0.02 | 0.03 | 0.01 | −0.01 | |
| p-value | NA | 3.52E-15 | 3.46E-17 | 3.78E-9 | 0.52 | |
| Mutation frequency | IGHV1 | 0.69 | 0.74 | 0.79 | 0.82 | 0.82 |
| IGHV3 | 0.68 | 0.74 | 0.79 | 0.8 | 0.79 | |
| IGHV4 | 0.69 | 0.74 | 0.79 | 0.84 | 0.84 | |
| IGHV2, 5, 6, 7 | 0.70 | 0.69 | 0.76 | 0.78 | 0.77 | |
| Avg correlation | 0.69 | 0.73 | 0.78 | 0.81 | 0.80 | |
| Best - S5F | NA | 0.04 | 0.09 | 0.12 | 0.11 | |
| Mean - S5F | NA | 0.03 | 0.07 | 0.09 | 0.09 | |
| p-value | NA | 2.76E-15 | 2.31E-17 | 2.31E-17 | 2.31E-17 | |
| Weighted substitution (substitution rate) | IGHV1 | 0.52 | 0.55 | 0.53 | 0.55 | 0.53 |
| IGHV3 | 0.51 | 0.52 | 0.53 | 0.52 | 0.51 | |
| IGHV4 | 0.54 | 0.55 | 0.54 | 0.57 | 0.54 | |
| IGHV2, 5, 6, 7 | 0.53 | 0.53 | 0.54 | 0.55 | 0.53 | |
| Avg correlation | 0.52 | 0.54 | 0.54 | 0.55 | 0.53 | |
| Best - S5F | NA | 0.02 | 0.02 | 0.03 | 0.01 | |
| Mean - S5F | NA | −0.06 | −0.08 | −0.11 | −0.17 | |
| p-value | NA | 4.19E-14 | 2.91E-16 | 6.82E-17 | 2.31E-17 | |
| Weighted substitution (mutation frequency) | IGHV1 | 0.69 | 0.74 | 0.78 | 0.80 | 0.81 |
| IGHV3 | 0.68 | 0.73 | 0.78 | 0.80 | 0.78 | |
| IGHV4 | 0.69 | 0.74 | 0.78 | 0.80 | 0.82 | |
| IGHV2, 5, 6, 7 | 0.70 | 0.70 | 0.75 | 0.77 | 0.78 | |
| Avg correlation | 0.69 | 0.73 | 0.77 | 0.79 | 0.80 | |
| Best - S5F | NA | 0.04 | 0.08 | 0.10 | 0.11 | |
| Mean - S5F | NA | 0 | 0.04 | 0.04 | 0 | |
| p-value | NA | 0.001 | 2.33E-8 | 1.49E-7 | 0.25 |
The correlations of repeatedly trained models using different random seeds (but the same hyperparameters) for neural network training had small standard deviations, in all cases below 0.01. p-values are from a Wilcoxon signed-rank test comparing the training results for each model with the corresponding S5F model accuracy. p-values were corrected (Benjamini-Hochberg) for multiple comparisons.