Table 2.
Characteristics of candidate whole blood transcriptional signatures for incipient tuberculosis included in systematic review and meta-analysis
Original number of genes | Model | Discovery population | Discovery HIV status | Discovery setting | Discovery approach | Intended application | Discovery tuberculosis cases | Discovery non-tuberculosis controls | Eligible signatures discovered* | |
---|---|---|---|---|---|---|---|---|---|---|
Anderson3819† | 42 | Disease risk score‡ | Children | HIV positive and negative | South Africa, Malawi | Elastic net using genome-wide data | Tuberculosis vs latent tuberculosis infection | 87 | 43 | 1 |
BATF215 | 1 | NA | Adults | HIV negative | UK | SVM using genome-wide data | Tuberculosis vs healthy (acute vs convalescent samples) | 46 | 31 | 1 |
Gjoen721 | 7 | LASSO regression§ | Children | HIV negative | India | LASSO using 198 preselected genes | Tuberculosis vs healthy controls and other diseases | 47 | 36 | 2 |
Gliddon323 | 3 | Disease risk score‡ | Adults | HIV positive and negative | South Africa, Malawi16 | Forward Selection-Partial Least Squares using genome-wide data | Tuberculosis vs latent tuberculosis infection | 285 (tuberculosis and non-tuberculosis) | .. | 1 |
Huang1131† | 13 | SVM (linear kernel) | Adults | HIV negative | UK22 | Common genes from elastic net, L1/2 and LASSO models, using genome-wide data | Tuberculosis vs healthy controls and other diseases | 16 | 79 | 1 |
Kaforou2516† | 27 | Disease risk score‡ | Adults | HIV positive and negative | South Africa, Malawi | Elastic net using genome-wide data | Tuberculosis vs latent tuberculosis infection | 285 (tuberculosis and non-tuberculosis) | .. | 1 |
Maertzdorf418 | 4 | Random forest¶ | Adults | HIV negative | India | Random forest using 360 selected target genes | Tuberculosis vs healthy | 113 | 76 | 2 |
NPC232 | 1 | NA | Adults | Not stated | Brazil | Differential expression using genome-wide data | Tuberculosis vs healthy | 6 | 28 | 3 |
Qian1733 | 17 | Sum of standardised expression | Adults | HIV negative | UK22 | Differential expression of nuclear factor, erythroid 2-like 2-mediated genes | Tuberculosis vs healthy controls and other diseases | 16 | 69 | 1 |
Rajan520 | 5 | Unsigned sums‡ | Adults | HIV positive | Uganda | Differential expression using genome-wide data | Tuberculosis vs healthy (active case finding among people living with HIV) | 80 total (1:2 cases:controls) | .. | 1 |
Roe326 | 3 | SVM (linear kernel) | Adults | HIV negative | UK | Stability selection, using genome-wide data | Incipient tuberculosis vs healthy | 46 | 31 | 1 |
Singhania2017 | 20 | Modified disease risk score‡‖ | Adults | HIV negative | UK, South Africa | Random forest using modular approach | Tuberculosis vs healthy controls and other diseases | Discovery set not explicitly stated | .. | 1 |
Suliman27 | 2 | ANKRD22 – OSBPL10 | Adults | HIV negative | Gambia, South Africa, Ethiopia | Pair ratios algorithm using genome-wide data | Incipient tuberculosis vs healthy | 79 | 328 | 4 |
Suliman47** | 4 | (GAS6 + SEPT4) –(CD1C + BLK) | Adults | HIV negative | Gambia, South Africa | Pair ratios algorithm using genome-wide data | Incipient tuberculosis vs healthy | 45 | 141 | 4 |
Sweeney314 | 3 | (GBP5 + DUSP3) ÷ 2 –KLF2 | Adults | HIV positive and negative | Meta-analysis | Significance thresholding and forward search in genome-wide data | Tuberculosis vs healthy controls and other diseases | 266 | 931 | 1 |
Walter4534† | 51 | SVM (linear kernel) | Adults | HIV negative | USA | SVMs, using genome-wide data | Tuberculosis vs latent tuberculosis infection | 24 | 24 | 1 |
Zak1624 | 16 | SVM (linear kernel) | Adolescents | HIV negative | South Africa | SVM-based gene pair models using genome-wide data | Incipient tuberculosis vs healthy | 37 | 77 | 1 |
Signatures are referred to by combining the first author's name of the corresponding publication as a prefix, with number of constituent genes as a suffix. For signatures where not all constituent genes were identifiable in the RNA sequencing data (eg, due to records being withdrawn), the suffix indicates the number of identifiable genes included in this analysis. Log2-transformed transcripts per million data used to calculate all signatures, unless otherwise specified. NA=not applicable. SVM=support vector machine. LASSO=least absolute shrinkage and selection operator.
Indicates total number of eligible signatures discovered in each study. Where multiple signatures were discovered for the same intended purpose and from the same training dataset, we included the signature with greatest accuracy, as defined by the area under the receiver operating characteristic curve in the validation data. Where accuracy was equivalent, we included the most parsimonious signature.
Anderson38 included 42 genes in the original, Huang11 had 13, Kaforou25 had 27, and Walter45 had 51 (genes not included in current models were either duplicates or not identifiable in RNA sequencing data).
For disease risk scores, the sum of downregulated genes was subtracted from the sum of upregulated genes. For unsigned sums and modified disease risk scores, genes were summed, irrespective of their direction of regulation.
Calculated using non-log-transformed data using model coefficients from original publication.
Required normalisation of the training and test sets. This was done for each gene by subtracting the mean expression across all samples in the dataset and dividing by the SD.
Calculated using non-log-transformed counts per million data with trimmed mean of M-values normalisation, as per original description.
Modelling approach was not clear from the original description. We recreated this using two approaches: as a simple equation of gene pairs ((GAS6+SEPT4)–(CD1C+BLK)) and as an SVM using the four constituent gene pairs, as previously described.35 Because the former approach achieved marginally better performance that was closer to the authors' original description in their test dataset, this was included in the final analysis.