Table 3. Distinct input variable sets used for the machine learning analyses, and learning algorithm types.
A. Distinct input variable sets used for the Super Learning analyses | |
Variable set name | Variables1 |
1: geog | Geographic region (Asia/Europe and Americas/N. Africa/S. Africa) |
2: geog.AAchVRC01 | geog + Group 12 (AAs in the VRC01 binding footprint) (97, 124, 198, 276, 278, 279, 280, 281, 282, 365, 371, 428, 429, 430, 455, 456, 458, 459, 460, 461, 463, 465, 466, 467, 474, 476) |
3: geog.AAchCD4bs | geog + Group 2 (AAs in the CD4 binding site) (124, 125, 198, 279, 280, 281, 282, 283, 365, 369, 374, 425, 426, 428, 429, 430, 432, 455, 456, 458, 459, 460, 461, 471, 474, 475, 476, 477) |
4: geog.AAchESA | geog + Group 3 (AAs with sufficient Exposed Surface Area) (97, 198, 276, 278, 279, 280, 281, 282, 365, 371, 415, 428, 429, 430, 455, 458, 459, 460, 461, 467, 474, 476) |
5: geog.AAchGLYCO | geog + Group 4 (AAs important for glycosylation) (61, 197, 276, 362, 363, 386, 392, 462, 463) |
6: geog.AAchCOVAR | geog + Group 5 (AAs that covary with the VRC01 binding footprint) (46, 132, 138, 144, 150, 179, 181, 186, 190, 290, 321, 328, 354, 389, 394, 396, 397, 406) |
7: geog.AAchPNGS | geog + Group 6 (AAs associated with VRC01-specific PNGS effects) (130, 139, 143, 156, 187, 197, 241, 289, 339, 355, 363, 406, 408, 410, 442, 448, 460, 462) |
8: geog.AAchgp41 | geog + All gp41 sites that affect global neutralization sensitivity (544, 569, 582, 589, 655, 668, 675, 677, 680, 681, 683, 688, 702) |
9: geog.AAchGlyGP160 | geog + All gp160 N-glycosylation sites that are not included in VRC01 contact sites or paratope or sites with covariability |
10: geog.st | geog + Group 8 (viral subtypes) (01 AE/02 AG/07 BC/A1/A1C/A1D/B/C/D/O/Other) |
11: geog.sequonCt | geog + Group 9 (region-specific PNGS counts) |
12: geog.geom | geog + Group 10 (viral geometry metrics) |
13: geog.cys | geog + Group 11 (counts of cysteine residues in certain regions) |
14: geog.sbulk | geog + Group 12 (steric bulk at critical locations) |
15: geog.corP | geog + features selected with t-test univariate p-values |
16: geog.glmnet | geog + features selected with non-zero coefficients based off lasso |
17: geog.all.MCCV | All variables in sets 1–13, described above (AAs as positions 46, 61, 97, 124, 125, 130, 132, 138, 139, 143, 144, 150, 156, 179, 181, 186, 187, 190, 197, 198, 241, 276, 278, 279, 280, 281, 282, 283, 289, 290, 321, 328, 339, 354, 355, 362, 363, 365, 369, 371, 374, 386, 389, 392, 394, 396, 397, 406, 408, 410, 415, 425, 426, 428, 429, 430, 432, 442, 448, 455, 456, 458, 459, 460, 461, 462, 463, 465, 466, 467, 471, 474, 475, 476, and 477, plus all features in Groups 8 through 12) |
B. Learning algorithm types and the distinct input variable groups used with each learner | |
Algorithm type3 | Input variable groups from 3A |
SL.randomForest | 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16 |
SL.glmnet | 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16 |
SL.xgboost | 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16 |
SL.naiveBayes | 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16 |
SL.glm | 1,10,11,12,13,15,16 |
SL.step | 1,10,11,12,13,15,16 |
SL.step.interaction | 1,10,11,12,13,15,16 |
SL.mean | None |
geog = geography; AA = amino acid. AA positions are given in HXB2 coordinates.
1All amino acids included in the variable sets met the minimum variability filter that the site had to differ from the consensus site in at least 3 sequences in the entire CATNAP data set (i.e. before splitting into the two analysis sets).
2See Methods for details on listed input variable Groups 1−13
3The algorithms are listed by the functions used in the SuperLearner R package. An exception is “SL.naiveBayes”, which was a custom wrapper designed to use the naiveBayes function from the e1071 package. The SL.glmnet package was used with the lasso penalty. All tuning parameters are set to the default values of the SuperLearner package, except SL.xgboost, which we modified to fit decision stumps rather than trees.