Skip to main content
. 2020 Apr;8(4):395–406. doi: 10.1016/S2213-2600(19)30282-6

Table 2.

Characteristics of candidate whole blood transcriptional signatures for incipient tuberculosis included in systematic review and meta-analysis

Original number of genes Model Discovery population Discovery HIV status Discovery setting Discovery approach Intended application Discovery tuberculosis cases Discovery non-tuberculosis controls Eligible signatures discovered*
Anderson3819 42 Disease risk score Children HIV positive and negative South Africa, Malawi Elastic net using genome-wide data Tuberculosis vs latent tuberculosis infection 87 43 1
BATF215 1 NA Adults HIV negative UK SVM using genome-wide data Tuberculosis vs healthy (acute vs convalescent samples) 46 31 1
Gjoen721 7 LASSO regression§ Children HIV negative India LASSO using 198 preselected genes Tuberculosis vs healthy controls and other diseases 47 36 2
Gliddon323 3 Disease risk score Adults HIV positive and negative South Africa, Malawi16 Forward Selection-Partial Least Squares using genome-wide data Tuberculosis vs latent tuberculosis infection 285 (tuberculosis and non-tuberculosis) .. 1
Huang1131 13 SVM (linear kernel) Adults HIV negative UK22 Common genes from elastic net, L1/2 and LASSO models, using genome-wide data Tuberculosis vs healthy controls and other diseases 16 79 1
Kaforou2516 27 Disease risk score Adults HIV positive and negative South Africa, Malawi Elastic net using genome-wide data Tuberculosis vs latent tuberculosis infection 285 (tuberculosis and non-tuberculosis) .. 1
Maertzdorf418 4 Random forest Adults HIV negative India Random forest using 360 selected target genes Tuberculosis vs healthy 113 76 2
NPC232 1 NA Adults Not stated Brazil Differential expression using genome-wide data Tuberculosis vs healthy 6 28 3
Qian1733 17 Sum of standardised expression Adults HIV negative UK22 Differential expression of nuclear factor, erythroid 2-like 2-mediated genes Tuberculosis vs healthy controls and other diseases 16 69 1
Rajan520 5 Unsigned sums Adults HIV positive Uganda Differential expression using genome-wide data Tuberculosis vs healthy (active case finding among people living with HIV) 80 total (1:2 cases:controls) .. 1
Roe326 3 SVM (linear kernel) Adults HIV negative UK Stability selection, using genome-wide data Incipient tuberculosis vs healthy 46 31 1
Singhania2017 20 Modified disease risk score Adults HIV negative UK, South Africa Random forest using modular approach Tuberculosis vs healthy controls and other diseases Discovery set not explicitly stated .. 1
Suliman27 2 ANKRD22 – OSBPL10 Adults HIV negative Gambia, South Africa, Ethiopia Pair ratios algorithm using genome-wide data Incipient tuberculosis vs healthy 79 328 4
Suliman47** 4 (GAS6 + SEPT4) –(CD1C + BLK) Adults HIV negative Gambia, South Africa Pair ratios algorithm using genome-wide data Incipient tuberculosis vs healthy 45 141 4
Sweeney314 3 (GBP5 + DUSP3) ÷ 2 –KLF2 Adults HIV positive and negative Meta-analysis Significance thresholding and forward search in genome-wide data Tuberculosis vs healthy controls and other diseases 266 931 1
Walter4534 51 SVM (linear kernel) Adults HIV negative USA SVMs, using genome-wide data Tuberculosis vs latent tuberculosis infection 24 24 1
Zak1624 16 SVM (linear kernel) Adolescents HIV negative South Africa SVM-based gene pair models using genome-wide data Incipient tuberculosis vs healthy 37 77 1

Signatures are referred to by combining the first author's name of the corresponding publication as a prefix, with number of constituent genes as a suffix. For signatures where not all constituent genes were identifiable in the RNA sequencing data (eg, due to records being withdrawn), the suffix indicates the number of identifiable genes included in this analysis. Log2-transformed transcripts per million data used to calculate all signatures, unless otherwise specified. NA=not applicable. SVM=support vector machine. LASSO=least absolute shrinkage and selection operator.

*

Indicates total number of eligible signatures discovered in each study. Where multiple signatures were discovered for the same intended purpose and from the same training dataset, we included the signature with greatest accuracy, as defined by the area under the receiver operating characteristic curve in the validation data. Where accuracy was equivalent, we included the most parsimonious signature.

Anderson38 included 42 genes in the original, Huang11 had 13, Kaforou25 had 27, and Walter45 had 51 (genes not included in current models were either duplicates or not identifiable in RNA sequencing data).

For disease risk scores, the sum of downregulated genes was subtracted from the sum of upregulated genes. For unsigned sums and modified disease risk scores, genes were summed, irrespective of their direction of regulation.

§

Calculated using non-log-transformed data using model coefficients from original publication.

Required normalisation of the training and test sets. This was done for each gene by subtracting the mean expression across all samples in the dataset and dividing by the SD.

Calculated using non-log-transformed counts per million data with trimmed mean of M-values normalisation, as per original description.

**

Modelling approach was not clear from the original description. We recreated this using two approaches: as a simple equation of gene pairs ((GAS6+SEPT4)–(CD1C+BLK)) and as an SVM using the four constituent gene pairs, as previously described.35 Because the former approach achieved marginally better performance that was closer to the authors' original description in their test dataset, this was included in the final analysis.