. 2020 Apr;8(4):395–406. doi: 10.1016/S2213-2600(19)30282-6

Table 2.

Characteristics of candidate whole blood transcriptional signatures for incipient tuberculosis included in systematic review and meta-analysis

	Original number of genes	Model	Discovery population	Discovery HIV status	Discovery setting	Discovery approach	Intended application	Discovery tuberculosis cases	Discovery non-tuberculosis controls	Eligible signatures discovered^*
Anderson38¹⁹ ^†	42	Disease risk score^‡	Children	HIV positive and negative	South Africa, Malawi	Elastic net using genome-wide data	Tuberculosis vs latent tuberculosis infection	87	43	1
BATF2¹⁵	1	NA	Adults	HIV negative	UK	SVM using genome-wide data	Tuberculosis vs healthy (acute vs convalescent samples)	46	31	1
Gjoen7²¹	7	LASSO regression^§	Children	HIV negative	India	LASSO using 198 preselected genes	Tuberculosis vs healthy controls and other diseases	47	36	2
Gliddon3²³	3	Disease risk score^‡	Adults	HIV positive and negative	South Africa, Malawi¹⁶	Forward Selection-Partial Least Squares using genome-wide data	Tuberculosis vs latent tuberculosis infection	285 (tuberculosis and non-tuberculosis)	..	1
Huang11³¹ ^†	13	SVM (linear kernel)	Adults	HIV negative	UK²²	Common genes from elastic net, L_1/2 and LASSO models, using genome-wide data	Tuberculosis vs healthy controls and other diseases	16	79	1
Kaforou25¹⁶ ^†	27	Disease risk score^‡	Adults	HIV positive and negative	South Africa, Malawi	Elastic net using genome-wide data	Tuberculosis vs latent tuberculosis infection	285 (tuberculosis and non-tuberculosis)	..	1
Maertzdorf4¹⁸	4	Random forest^¶	Adults	HIV negative	India	Random forest using 360 selected target genes	Tuberculosis vs healthy	113	76	2
NPC2³²	1	NA	Adults	Not stated	Brazil	Differential expression using genome-wide data	Tuberculosis vs healthy	6	28	3
Qian17³³	17	Sum of standardised expression	Adults	HIV negative	UK²²	Differential expression of nuclear factor, erythroid 2-like 2-mediated genes	Tuberculosis vs healthy controls and other diseases	16	69	1
Rajan5²⁰	5	Unsigned sums^‡	Adults	HIV positive	Uganda	Differential expression using genome-wide data	Tuberculosis vs healthy (active case finding among people living with HIV)	80 total (1:2 cases:controls)	..	1
Roe3²⁶	3	SVM (linear kernel)	Adults	HIV negative	UK	Stability selection, using genome-wide data	Incipient tuberculosis vs healthy	46	31	1
Singhania20¹⁷	20	Modified disease risk score^‡^‖	Adults	HIV negative	UK, South Africa	Random forest using modular approach	Tuberculosis vs healthy controls and other diseases	Discovery set not explicitly stated	..	1
Suliman2⁷	2	ANKRD22 – OSBPL10	Adults	HIV negative	Gambia, South Africa, Ethiopia	Pair ratios algorithm using genome-wide data	Incipient tuberculosis vs healthy	79	328	4
Suliman4⁷ ^**	4	(GAS6 + SEPT4) –(CD1C + BLK)	Adults	HIV negative	Gambia, South Africa	Pair ratios algorithm using genome-wide data	Incipient tuberculosis vs healthy	45	141	4
Sweeney3¹⁴	3	(GBP5 + DUSP3) ÷ 2 –KLF2	Adults	HIV positive and negative	Meta-analysis	Significance thresholding and forward search in genome-wide data	Tuberculosis vs healthy controls and other diseases	266	931	1
Walter45³⁴ ^†	51	SVM (linear kernel)	Adults	HIV negative	USA	SVMs, using genome-wide data	Tuberculosis vs latent tuberculosis infection	24	24	1
Zak16²⁴	16	SVM (linear kernel)	Adolescents	HIV negative	South Africa	SVM-based gene pair models using genome-wide data	Incipient tuberculosis vs healthy	37	77	1

Signatures are referred to by combining the first author's name of the corresponding publication as a prefix, with number of constituent genes as a suffix. For signatures where not all constituent genes were identifiable in the RNA sequencing data (eg, due to records being withdrawn), the suffix indicates the number of identifiable genes included in this analysis. Log₂-transformed transcripts per million data used to calculate all signatures, unless otherwise specified. NA=not applicable. SVM=support vector machine. LASSO=least absolute shrinkage and selection operator.

Indicates total number of eligible signatures discovered in each study. Where multiple signatures were discovered for the same intended purpose and from the same training dataset, we included the signature with greatest accuracy, as defined by the area under the receiver operating characteristic curve in the validation data. Where accuracy was equivalent, we included the most parsimonious signature.

^†

Anderson38 included 42 genes in the original, Huang11 had 13, Kaforou25 had 27, and Walter45 had 51 (genes not included in current models were either duplicates or not identifiable in RNA sequencing data).

^‡

For disease risk scores, the sum of downregulated genes was subtracted from the sum of upregulated genes. For unsigned sums and modified disease risk scores, genes were summed, irrespective of their direction of regulation.

^§

Calculated using non-log-transformed data using model coefficients from original publication.

^¶

Required normalisation of the training and test sets. This was done for each gene by subtracting the mean expression across all samples in the dataset and dividing by the SD.

^‖

Calculated using non-log-transformed counts per million data with trimmed mean of M-values normalisation, as per original description.

^**

Modelling approach was not clear from the original description. We recreated this using two approaches: as a simple equation of gene pairs ((GAS6+SEPT4)–(CD1C+BLK)) and as an SVM using the four constituent gene pairs, as previously described.³⁵ Because the former approach achieved marginally better performance that was closer to the authors' original description in their test dataset, this was included in the final analysis.