Skip to main content
. 2019 Apr 1;15(4):e1006952. doi: 10.1371/journal.pcbi.1006952

Table 3. Distinct input variable sets used for the machine learning analyses, and learning algorithm types.

A. Distinct input variable sets used for the Super Learning analyses
Variable set name Variables1
1: geog Geographic region (Asia/Europe and Americas/N. Africa/S. Africa)
2: geog.AAchVRC01 geog + Group 12 (AAs in the VRC01 binding footprint) (97, 124, 198, 276, 278, 279, 280, 281, 282, 365, 371, 428, 429, 430, 455, 456, 458, 459, 460, 461, 463, 465, 466, 467, 474, 476)
3: geog.AAchCD4bs geog + Group 2 (AAs in the CD4 binding site) (124, 125, 198, 279, 280, 281, 282, 283, 365, 369, 374, 425, 426, 428, 429, 430, 432, 455, 456, 458, 459, 460, 461, 471, 474, 475, 476, 477)
4: geog.AAchESA geog + Group 3 (AAs with sufficient Exposed Surface Area) (97, 198, 276, 278, 279, 280, 281, 282, 365, 371, 415, 428, 429, 430, 455, 458, 459, 460, 461, 467, 474, 476)
5: geog.AAchGLYCO geog + Group 4 (AAs important for glycosylation) (61, 197, 276, 362, 363, 386, 392, 462, 463)
6: geog.AAchCOVAR geog + Group 5 (AAs that covary with the VRC01 binding footprint) (46, 132, 138, 144, 150, 179, 181, 186, 190, 290, 321, 328, 354, 389, 394, 396, 397, 406)
7: geog.AAchPNGS geog + Group 6 (AAs associated with VRC01-specific PNGS effects) (130, 139, 143, 156, 187, 197, 241, 289, 339, 355, 363, 406, 408, 410, 442, 448, 460, 462)
8: geog.AAchgp41 geog + All gp41 sites that affect global neutralization sensitivity (544, 569, 582, 589, 655, 668, 675, 677, 680, 681, 683, 688, 702)
9: geog.AAchGlyGP160 geog + All gp160 N-glycosylation sites that are not included in VRC01 contact sites or paratope or sites with covariability
10: geog.st geog + Group 8 (viral subtypes) (01 AE/02 AG/07 BC/A1/A1C/A1D/B/C/D/O/Other)
11: geog.sequonCt geog + Group 9 (region-specific PNGS counts)
12: geog.geom geog + Group 10 (viral geometry metrics)
13: geog.cys geog + Group 11 (counts of cysteine residues in certain regions)
14: geog.sbulk geog + Group 12 (steric bulk at critical locations)
15: geog.corP geog + features selected with t-test univariate p-values
16: geog.glmnet geog + features selected with non-zero coefficients based off lasso
17: geog.all.MCCV All variables in sets 1–13, described above (AAs as positions 46, 61, 97, 124, 125, 130, 132, 138, 139, 143, 144, 150, 156, 179, 181, 186, 187, 190, 197, 198, 241, 276, 278, 279, 280, 281, 282, 283, 289, 290, 321, 328, 339, 354, 355, 362, 363, 365, 369, 371, 374, 386, 389, 392, 394, 396, 397, 406, 408, 410, 415, 425, 426, 428, 429, 430, 432, 442, 448, 455, 456, 458, 459, 460, 461, 462, 463, 465, 466, 467, 471, 474, 475, 476, and 477, plus all features in Groups 8 through 12)
B. Learning algorithm types and the distinct input variable groups used with each learner
Algorithm type3 Input variable groups from 3A
SL.randomForest 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16
SL.glmnet 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16
SL.xgboost 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16
SL.naiveBayes 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16
SL.glm 1,10,11,12,13,15,16
SL.step 1,10,11,12,13,15,16
SL.step.interaction 1,10,11,12,13,15,16
SL.mean None

geog = geography; AA = amino acid. AA positions are given in HXB2 coordinates.

1All amino acids included in the variable sets met the minimum variability filter that the site had to differ from the consensus site in at least 3 sequences in the entire CATNAP data set (i.e. before splitting into the two analysis sets).

2See Methods for details on listed input variable Groups 1−13

3The algorithms are listed by the functions used in the SuperLearner R package. An exception is “SL.naiveBayes”, which was a custom wrapper designed to use the naiveBayes function from the e1071 package. The SL.glmnet package was used with the lasso penalty. All tuning parameters are set to the default values of the SuperLearner package, except SL.xgboost, which we modified to fit decision stumps rather than trees.