Skip to main content
Journal of Clinical Microbiology logoLink to Journal of Clinical Microbiology
. 2013 Feb;51(2):393–401. doi: 10.1128/JCM.02724-12

Diagnostic Accuracy and Reproducibility of WHO-Endorsed Phenotypic Drug Susceptibility Testing Methods for First-Line and Second-Line Antituberculosis Drugs

David J Horne a,, Lancelot M Pinto b, Matthew Arentz a, S-Y Grace Lin c, Edward Desmond c, Laura L Flores d, Karen R Steingart e, Jessica Minion f
PMCID: PMC3553871  PMID: 23152548

Abstract

In an effort to update and clarify policies on tuberculosis drug susceptibility testing (DST), the World Health Organization (WHO) commissioned a systematic review evaluating WHO-endorsed diagnostic tests. We report the results of this systematic review and meta-analysis of the diagnostic accuracy and reproducibility of phenotypic DST for first-line and second-line antituberculosis drugs. This review provides support for recommended critical concentrations for isoniazid and rifampin in commercial broth-based systems. Further studies are needed to evaluate critical concentrations for ethambutol and streptomycin that accurately detect susceptibility to these drugs. Evidence is limited on the performance of DST for pyrazinamide and second-line drugs.

INTRODUCTION

The global epidemic of drug-resistant tuberculosis (DR-TB), particularly multidrug-resistant TB (MDR-TB, defined as resistance to at least isoniazid and rifampin), is one of the most serious problems facing TB care and control efforts. In their most recent worldwide survey, the World Health Organization (WHO) documented the highest rates of MDR-TB ever reported (1). Effective management of drug-resistant TB relies on multiple components, including detection, treatment, prevention, surveillance, and continuous program evaluation (2). Expanding the capacity to diagnose cases of drug-resistant TB is a priority for global TB control, requiring clear policies on the use of diagnostic tests and strengthened laboratories in which testing can be safely and effectively carried out (3).

Conventional phenotypic drug susceptibility testing (DST) using the proportion method (PM) on solid media has been well studied for isoniazid and rifampin, with a general consensus achieved regarding methodology, critical concentrations, and expected performance (4). However, the diagnostic accuracy and reproducibility of DST for other first-line and second-line drugs are inadequate (5). DST for second-line anti-TB drugs has not been standardized internationally, which is reflected in the wide variability of practices among supranational reference laboratories, underscoring the need for standardization of the methods and interpretive criteria for second-line DST (4, 6). Further complicating the lack of consensus is the increasing number of different DST methods that are available. In 2008, the WHO Stop TB Department published interim laboratory policy guidance for DST of second-line anti-TB drugs (4). Guideline recommendations for a specific DST method should ideally be based on the diagnostic test accuracy, reproducibility, ease of use, cost, and rapidity of result availability.

In an attempt to address gaps in the evidence base, WHO initiated an update to the 2008 interim guidelines with an expanded scope to include DST for all first-line and second-line drugs. As part of the update, we conducted a systematic review to determine the diagnostic accuracy and reproducibility of WHO-endorsed phenotypic DST methods and commercial genotypic DST methods for first-line and second-line anti-TB drugs. While several systematic reviews have previously evaluated specific DST methods (714), we assessed a variety of DST methods, focusing on studies that used WHO-defined criteria for drug resistance in the reference standard. We also collected data on test reproducibility.

(The evidence from this systematic review was presented at a WHO Expert Group meeting in March 2012. The current minireview presents results for phenotypic DST methods. Upon finalization, the full Expert Group meeting report, including results for genotypic DST methods, is to be posted on the WHO website [www.who.int/tb/laboratory/policy_statements/].)

APPROACH

We followed standard guidelines for systematic reviews of diagnostic test accuracy as recommended by the Cochrane Collaboration Diagnostic Test Accuracy Working Group, including the development of a detailed protocol prior to starting the review (15, 16).

Criteria for considering studies for this review.

We considered primary studies, regardless of study design, that evaluated a phenotypic method of DST on Mycobacterium tuberculosis. We included direct and indirect methods of DST, performed on any specimen, from all patients confirmed or suspected of having TB, from all settings and countries.

Studies were eligible for inclusion that either (i) evaluated an index test against a defined reference standard and allowed an estimation of sensitivity, specificity, and agreement between the tests or (ii) studied the reproducibility of DST results with an index test when testing the same M. tuberculosis isolate two or more times. Acceptable reference standards, determined in consultation with WHO, were the proportion method (PM) performed on Löwenstein-Jensen (LJ), Middlebrook 7H10, or Middlebrook 7H11 agar medium and use of the Bactec 460 system (Becton, Dickinson). We excluded studies from which we were unable to extract data for true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). We excluded studies that were reported in correspondence or conference abstracts.

Search methods for identification of studies.

We searched MEDLINE, EMBASE, and the Cochrane Library (from the earliest record dates to 30 November 2011). We restricted the search to studies published in English, French, or Spanish. If peer-reviewed systematic reviews had been performed for an index test, we updated the systematic review to include studies published since prior reviews were completed. Details of the MEDLINE search are provided in Appendix SA in the supplemental material.

Selection of studies and data extraction.

Three pairs of trained reviewers screened the titles and abstracts of identified articles for relevance. Each reviewer within a pair worked independently to identify potentially eligible studies. Any citation identified as “include” during this stage (screen 1) was selected for full-text review and assessed for study eligibility using predefined inclusion and exclusion criteria (screen 2). In screen 2, any discrepancies were resolved by a discussion between the reviewers in the pair or, if they were unable to resolve the discrepancy, by a decision of the lead reviewer (D. J. Horne). We reviewed the reference lists of select original papers and reviews to identify additional relevant studies. A list of excluded studies and their reasons for exclusion is available upon request.

Two reviewers (D. J. Horne and K. R. Steingart) developed, piloted, and refined a standardized data extraction form (see Appendix SB in the supplemental material). The same pairs of reviewers extracted data from each study.

Assessment of methodological quality.

Study quality was assessed with Quality Assessment of Diagnostic Accuracy Studies (QUADAS), version 2 (17). As recommended, all four domains in the QUADAS-2 tool (patient selection, index test, reference standard, and flow and timing) were judged as low, high, or unclear risk of bias, and the first three domains were judged for concerns of applicability. Disagreements were resolved by a third reviewer (K. R. Steingart). Appendix SC in the supplemental material describes the criteria that needed to be met for judgments concerning risk of bias or applicability for each of the QUADAS-2 domains.

Statistical analysis and data synthesis.

We adopted the following overall approach to data analysis a priori to account for the considerable heterogeneity in results expected across studies of the different anti-TB drugs and different index tests included in the review. First, we classified the following subgroups: (i) commercial broth-based tests, (ii) noncommercial solid-medium-based tests, and (iii) novel noncommercial tests. Second, within each subgroup, we separately synthesized data for each drug according to the index test and whether the reference standard used a currently recommended critical concentration. Finally, we classified studies by whether the index tests were performed as direct or indirect tests.

For each study, we determined sensitivity and specificity along with 95% confidence intervals (CIs) with exact methods using Stata/IC 11.0 (Stata Corp.) (18) and generated forest plots using Meta-DiSc (version 1.4) (19). When possible, we performed meta-analyses by drug for each index test to determine pooled performance estimates. To be included in a meta-analysis subgroup, studies were required to satisfy the following criteria: (i) availability of at least four studies using the same critical concentration for the index test and (ii) use of the currently recommended critical concentration for the reference standard. We separately pooled direct and indirect testing. We used a bivariate random effects regression model to determine summary estimates for sensitivity and specificity (16, 20), using the user-written metandi command in Stata/IC 11.0 (18). If the regression model did not fit the data and there was little discordance between the index test and the reference standard, we pooled sensitivity and specificity data separately using Meta-DiSc with the reasoning that there was little or no correlation between these measures across studies (19). Pooled agreement estimates were determined using random effects regression modeling in Stata/IC 11.0.

FINDINGS

Search results.

The initial search yielded 8,464 citations (Fig. 1). After full-text review of 229 papers identified in electronic searching and 25 papers identified in the bibliography review, we identified 108 distinct papers on diagnostic accuracy and/or reproducibility of phenotypic DST methods: 45 papers evaluating DST for first-line drugs, 12 papers evaluating DST for second-line drugs, and 51 papers evaluating DST for both first-line and second-line drugs (see Appendix SD in the supplemental material). Among the 96 papers that evaluated DST for first-line drugs, 93 papers evaluated only diagnostic test accuracy; 2 papers evaluated diagnostic test accuracy and reproducibility; and 1 paper evaluated only reproducibility. Of the 63 on DST for second-line drugs, 57 papers evaluated only diagnostic test accuracy; 3 papers evaluated diagnostic test accuracy and reproducibility; and 3 papers evaluated only reproducibility. As some papers evaluated two or more index tests, drugs, and/or their critical concentrations, we identified these different evaluations (here referred to as studies) within each paper. In total, there were 405 studies containing 18,666 samples (42,926 total tests) that addressed diagnostic accuracy and 43 studies (5,670 total tests) that addressed reproducibility.

Fig 1.

Fig 1

Flow of studies in the review. *, a total of 45 papers evaluated DST for first-line drugs, 12 for second-line drugs, and 51 for both first- and second-line drugs.

Study characteristics.

Eleven different index tests are represented in this review (Table 1). These tests include (i) commercial broth-based systems, such as MGIT 960 (BD Diagnostic Systems, Sparks, MD), BBL Mycobacterial Growth Indicator Tube (“MGIT manual”; BD Diagnostic Systems), and VersaTREK Mycobacteria Detection & Susceptibility (Thermo Fisher Scientific, Sun Prairie, WI); (ii) noncommercial solid-medium methods, such as the resistance ratio and absolute concentration techniques on LJ media; and (iii) noncommercial novel tests, such as colorimetric redox indicator assays using alamarBlue, resazurin, or tetrazolium, the Microscopic Observation Direct Susceptibility (MODS) assay (now available in both a noncommercial and a commercial version [Hardy Diagnostics]), and the nitrate reductase assay (NRA) using either solid or liquid media. The majority of studies (354/405; 87.4%) performed indirect testing on culture isolates. Fifty-six percent of first-line drug testing and 38% of second-line drug testing occurred in low- or middle-income countries. Of the total included studies, 367 (90.6%) chose a critical concentration for the reference standard (when available) according to current recommendations. The median number of samples included in the studies was 77 (interquartile range, 45 to 132).

Table 1.

Summary of characteristics of included studies

Category First-line drugs
Second-line drugs
All drugs
No. of studies (290) % of total No. of studies (115) % of total No. of studies (405) % of total
Drug
    Isoniazid 110 37.9 110 27.2
    Rifampin 110 37.9 110 27.2
    Pyrazinamide 4 1.4 4 1.0
    Ethambutol 66 22.8 66 16.3
    Amikacin 3 2.6 3 0.7
    Capreomycin 7 6.1 7 1.7
    Cycloserine 1 0.9 1 0.2
    Ethionamide 7 6.1 7 1.7
    Gatifloxacin 1 0.9 1 0.2
    Kanamycin 9 7.8 9 2.2
    Linezolid 2 1.7 2 0.5
    Moxifloxacin 3 2.6 3 0.5
    Ofloxacin 16 13.9 16 4.0
    p-Aminosalicylic acid (PAS) 4 3.5 4 1.0
    Prothionamide 1 0.9 1 0.2
    Streptomycin 61 53.0 61 15.1
Index test
    Nitrate reductase assay, solid media 60 20.7 21 18.3 81 20.0
    MGIT manual 53 18.3 19 16.5 72 17.8
    MGIT 960 41 14.1 34 29.6 75 18.5
    Tetrazolium 33 11.4 6 5.2 39 9.6
    Resazurin 30 10.3 17 14.8 47 11.6
    Microscopic observation drug susceptibility 29 10.0 5 4.4 34 8.4
    alamarBlue 28 9.7 7 6.1 35 8.6
    VersaTREK 6 2.1 1 0.9 7 1.7
    Nitrate reductase assay, liquid media 5 1.7 1 0.9 6 1.5
    Löwenstein-Jensen, resistance ratio method 3 1.0 2 1.7 5 1.2
    Löwenstein-Jensen, absolute concn method 2 0.7 2 1.7 4 1.0
Type of index test
    Direct 47 16.2 4 3.5 51 12.6
    Indirect 243 83.8 111 96.5 354 87.4
Type of specimen
    Sputum 52 17.9 5 4.4 57 14.1
    Isolate, not otherwise specified 238 82.1 110 95.7 348 85.9
Location of laboratory testing by World Bank country designation
    Low and middle income 162 55.9 44 38.3 206 50.9
    High income 121 41.7 64 55.7 185 45.7
    Both low/middle and high income 7 2.4 7 6.1 14 3.5

Methodological quality of included studies.

In the studies assessing DST for first-line anti-TB drugs, around 80% of studies were considered to be at high or unclear risk of bias in patient selection because these studies lacked consecutive or random selection of patients or samples, used a case-control study design, or did not report this information (see Fig. S1 in the supplemental material). For the domain concerning the index test, only 38% of studies were considered to be at low risk of bias because the index test result was interpreted blindly. For the domain concerning the reference standard, around 44% of studies were considered to be at low risk of bias because the reference standard result was interpreted blindly. Almost all studies were considered to be at low risk of bias for flow and timing since we could account for all patients in the 2-by-2 tables. We considered almost all studies to be at low concern for the three domains (patient selection, index test, and reference standard) addressed by applicability (data not shown). We found similar results for studies assessing DST for second-line anti-TB drugs: the majority of studies were considered to be at high risk of bias for patient selection, index test, and reference standard and low concern for applicability (data not shown).

Diagnostic accuracy of the index test compared with the reference standard. (i) Isoniazid.

We identified 110 studies that evaluated the diagnostic accuracy of DST for isoniazid (for a complete list, see Table S1 in the supplemental material). In general, agreement was high (>90%) for all assays. As the critical concentration of isoniazid for the novel noncommercial assays has not been established, the definition of resistance was not consistent and may have led to variations in diagnostic accuracy. For example, alamarBlue was evaluated in 12 studies with eight different critical concentrations, resazurin was evaluated in 12 studies with five different critical concentrations, and tetrazolium was evaluated in 11 studies using four different critical concentrations.

(ii) Rifampin.

We identified 110 studies that evaluated the diagnostic accuracy of DST for rifampin. Rifampin showed high agreement for all commercial broth-based tests (range, 93.1% to 100%). Agreement was moderate to high when all tests were included (range, 83.9% to 100%). There were several studies with levels of agreement of less than 90%, including colorimetric index tests (n = 2) and MODS direct testing (n = 1).

(iii) Ethambutol.

Among 66 (22.8%) studies that evaluated ethambutol DST, there was a wide range in agreement (57.1% to 100%). When studies were limited to those that used a recommended critical concentration for the reference standard (n = 57), the range remained similar. This wide range in agreement was shared by all methods that evaluated ethambutol DST and indicates relatively high variability in ethambutol DST performance.

(iv) Streptomycin.

There were 61 studies evaluating streptomycin DST. Individual studies found sensitivities ranging from 29.0% to 100% and specificities ranging from 54.5% to 100%, indicating high variability in the performance of streptomycin DST.

(v) Ofloxacin.

A total of 16 studies looked at ofloxacin DST. The majority of these found 100% accuracy; the range of sensitivities was 86.3% to 100% and the range of specificities was 85.7% to 100%.

(vi) Other antituberculosis drugs.

Concerning pyrazinamide and second-line drugs other than streptomycin and ofloxacin, the number of included studies was limited. Only one to nine studies were available for each drug, with no more than three studies evaluating comparable assays (i.e., the same index test, using the same critical concentration, evaluated in comparison to a reference standard using a recommended critical concentration) (see Table S2 in the supplemental material).

Meta-analysis.

Using only those studies that met our criteria for meta-analysis, pooled sensitivity, specificity, and agreement estimates for isoniazid, rifampin, ethambutol, streptomycin, and ofloxacin are presented in Tables 2, 3, 4, and 5. There were no subgroups of any given index test with two or more critical concentrations that met meta-analysis criteria; thus, no direct comparison between different critical concentrations could be made.

Table 2.

Isoniazid DST meta-analyses: pooled estimates of sensitivity, specificity, and agreement by test and type of specimena

Isoniazid test method Indirect DST
Direct DST
No. of studies (n) % sensitivity (95% CI) % specificity (95% CI) % agreement (95% CI) No. of studies (n) % sensitivity (95% CI) % specificity (95% CI) % agreement (95% CI)
Commercial
    MGIT manual (CC = 0.1 μg/ml) 15 (1,339) 97.1 (92.7–98.9) 97.6 (94.5–99.0) 97.3 (95.7–98.8) 1 (101)
    MGIT 960 (CC = 0.1 μg/ml) 10 (811) 98.9 (94.4–99.8) 98.2 (95.4–99.3) 98.7 (97.7–99.7)
Noncommercial
    Microscopic Observation Drug Susceptibility (MODS) assayb (CC = 0.1 μg/ml) 2 (84) 4 (691) 94.4 (90.1–96.9) 91.8 (82.9–96.2) 92.9 (88.9–96.8)
    NRA, solid media (CC = 0.2 μg/ml) 13 (1,587) 96.8 (94.6–98.1) 100 (95.5–100) 99.2 (98.6–99.8) 8 (934) 97.2 (94.1–98.7) 98.5 (96.2–99.4) 98.4 (97.3–99.5)
    alamarBlue (CC = 0.2 or 0.25 μg/ml) 4 (263) 92.8 (80.2–97.6) 97.1 (85.5–99.5) 95.6 (90.4–100)
    Resazurin (CC = 0.25 μg/ml) 7 (603) 99.8 (85.9–100) 98.6 (95.8–99.6) 99.5 (98.6–100)
    Tetrazolium (CC = 0.2 or 0.25 μg/ml) 9 (748) 98.2 (92.7–99.6) 98.8 (96.9–99.5) 99.1 (98.1–100)
a

The models used in determining diagnostic test accuracy are called “hierarchical models” because they involve statistical distributions at two levels. At the first level, they model the values in cells of the 2-by-2 tables extracted from each study using binomial distributions and logistic (log-odds) transformations of proportions. At the second (higher) level, the models assume random study effects to account for heterogeneity in diagnostic test accuracy between studies beyond that accounted for by sampling variability at the lower level. In cases where it was not possible to summarize the data with the hierarchical model, we pooled sensitivity and specificity estimates separately. CC, critical concentration; NRA, nitrate reductase assay.

b

MODS is now available in both noncommercial and commercial versions (Hardy Diagnostics).

Table 3.

Rifampin DST meta-analyses: pooled estimates of sensitivity, specificity, and agreement by test and type of specimena

Rifampin test method Indirect DST
Direct DST
No. of studies (n) % sensitivity (95% CI) % specificity (95% CI) % agreement (95% CI) No. of studies (n) % sensitivity (95% CI) % specificity (95% CI) % agreement (95% CI)
Commercial
    MGIT manual (CC = 1 μg/ml) 16 (1,447) 95.0 (89.2–97.8) 100 (95.1–100) 99.0 (98.3–99.7) 1 (101)
    MGIT 960 (CC = 1 μg/ml) 10 (800) 98.2 (92.8–99.6) 99.6 (98.5–99.9) 99.5 (98.6–100) 1 (222)
Noncommercial
    Microscopic Observation Drug Susceptibility (MODS) assayb (CC = 1 μg/ml) 2 (95) 5 (823) 97.9 (85.3–99.7) 98.8 (90.8–99.8) 97.5 (94.9–100)
    NRA, solid media (CC = 40 μg/ml) 15 (1,665) 98.2 (95.4–99.3) 99.9 (97.6–100) 99.5 (99.0–100) 9 (1,200) 96.3c (93.6–98.1) 99.5c (98.8–99.9) 99.4 (98.7–100)
    alamarBlue (CC = 0.25 μg/ml) 4 (166) 98.0c (89.4–99.9) 98.3c (93.9–99.8) 98.5 (95.7–100)
    Resazurin (CC = 0.5 μg/ml) 6 (504) 99.6 (82.2–100) 99.5 (97.6–99.9) 99.4 (98.4–100)
    Tetrazolium (CC = 1 μg/ml) 5 (544) 92.5 (78.2–97.7) 99.9 (74.8–100) 98.9 (96.8–100)
a

The models used in determining diagnostic test accuracy are called “hierarchical models” because they involve statistical distributions at two levels. At the first level, they model the values in cells of the 2-by-2 tables extracted from each study using binomial distributions and logistic (log-odds) transformations of proportions. At the second (higher) level, the models assume random study effects to account for heterogeneity in diagnostic test accuracy between studies beyond that accounted for by sampling variability at the lower level. CC, critical concentration; NRA, nitrate reductase assay.

b

MODS is now available in both noncommercial and commercial versions (Hardy Diagnostics).

c

In cases where it was not possible to summarize the data with the hierarchical model, we pooled sensitivity and specificity estimates separately.

Table 4.

Ethambutol DST meta-analyses: pooled estimates of sensitivity, specificity, and agreement by test and type of specimena

Ethambutol test method Indirect DST
Direct DST
No. of studies (total n) % sensitivity (95% CI) % specificity (95% CI) % agreement (95% CI) No. of studies (total n) % sensitivity (95% CI) % specificity (95% CI) % agreement (95% CI)
Commercial
    MGIT manual (CC = 3.5 μg/ml) 9 (890) 83.3 (42.0–97.2) 96.3 (91.2–98.5) 93.1 (89.8–96.5)
    MGIT 960 (CC = 5 μg/ml) 7 (647) 83.9 (72.7–91.1) 95.8 (80.9–99.2) 95.3 (92.5–98.0) 1 (222)
Noncommercial
    NRA, solid media (CC = 2 μg/ml) 10 (1,216) 94.3 (89.0–97.1) 99.0 (95.8–99.8) 97.8 (96.5–100) 1 (47)
    Tetrazolium (CC = 4 μg/ml) 4 (215) 91.9 (77.3–97.4) 91.6 (75.5–97.5) 89.6 (79.5–99.7)
a

The models used in determining diagnostic test accuracy are called “hierarchical models” because they involve statistical distributions at two levels. At the first level, they model the values in cells of the 2-by-2 tables extracted from each study using binomial distributions and logistic (log-odds) transformations of proportions. At the second (higher) level, the models assume random study effects to account for heterogeneity in diagnostic test accuracy between studies beyond that accounted for by sampling variability at the lower level. CC, critical concentration; NRA, nitrate reductase assay.

Table 5.

Second-line DST meta-analyses: pooled estimates of sensitivity, specificity, and agreement by test and type of specimena

Drug and test method Indirect DST
Direct DST
No. of studies (total n) % sensitivity (95% CI) % specificity (95% CI) % agreement (95% CI) No. of studies (total n) % sensitivity (95% CI) % specificity (95% CI) % agreement (95% CI)
Streptomycin
    Commercial
        MGIT manual (CC = 0.8 μg/ml) 10 94.1 (81.9–98.3) 94.6 (88.3–97.3) 93.0 (89.5–96.6)
        MGIT 960 (CC = 1 μg/ml) 7 99.7 (74.3–100) 94.3 (76.7–98.8) 96.4 (94.0–98.8)
    Noncommercial
        NRA, solid media (CC = 4 μg/ml) 10 91.2 (81.7–96.1) 97.5 (92.2–99.3) 94.6 (91.9–97.4)
        Tetrazolium (CC = 1 μg/ml) 4 90.8b (82.7–95.9) 91.4b (85.1–95.6) 92.9 (87.4–98.5)
Ofloxacin
    Commercial
        MGIT 960 (CC = 2 μg/ml) 4 99.2 (76.4–100) 99.9 (95.8–100) 100 (99.8–1.00)
    Noncommercial
        Resazurin (CC = 2 μg/ml) 4 100b (91.6–100) 100b (99.0–100) 100 (99.2–100)
a

The models used in determining diagnostic test accuracy are called “hierarchical models” because they involve statistical distributions at two levels. At the first level, they model the values in cells of the 2-by-2 tables extracted from each study using binomial distributions and logistic (log-odds) transformations of proportions. At the second (higher) level, the models assume random study effects to account for heterogeneity in diagnostic test accuracy between studies beyond that accounted for by sampling variability at the lower level. CC, critical concentration.

b

Data represent values calculated using univariate random effects logistic regression model due to inability of bivariate model to converge.

Isoniazid testing by either MGIT manual or MGIT 960 yielded high pooled estimates of agreement (97.3% [95% CI, 95.7 to 98.8] and 98.7% [95% CI, 97.7 to 99.7], respectively) (Table 2). Two of the three colorimetric assays, using resazurin and tetrazolium and a critical concentration of 0.2 or 0.25 μg/ml, had high sensitivity and specificity; the sensitivity of alamarBlue was lower at 92.8% (95% CI, 80.2 to 97.6). The NRA on solid media had accuracy using both indirect and direct methods. MODS using direct methods had an accuracy of 92.9% (95% CI, 88.9 to 96.8%).

For rifampin susceptibility testing using commercial broth-based and noncommercial novel methods, pooled specificity estimates were all >98%, as were most pooled sensitivity estimates (Table 3). Exceptions to this included MGIT manual (sensitivity, 95.0% [95% CI, 89.2 to 97.8]), tetrazolium (sensitivity, 92.5% [95% CI, 78.2 to 97.7]), and direct testing of both MODS and NRA-solid media (respective sensitivities, 97.9% [95% CI, 85.3 to 99.7] and 96.3% [95% CI, 93.6 to 98.1]).

Ethambutol DST using commercial broth tests at the recommended critical concentrations of 3.5 μg/ml for MGIT manual and 5 μg/ml for MGIT 960 showed relatively low sensitivities: 83.3% (95% CI, 42.0 to 97.2) and 83.9% (95% CI, 72.7 to 91.1), respectively. In comparison, both pooled sensitivity and specificity estimates for NRA performed on solid media (using a critical concentration of 2 μg/ml) exceeded those of the commercial liquid tests (Table 4).

Streptomycin testing by MGIT 960 (critical concentration, 1 μg/ml) yielded a high pooled sensitivity estimate (99.7% [95% CI, 74.3 to 100]) but only moderate specificity (94.3% [95% CI, 76.7 to 98.8]). In comparison, MGIT manual (critical concentration, 0.8 μg/ml) was found to have a lower pooled sensitivity (94.1% [95% CI, 81.9 to 98.3]) with a similar pooled specificity (Table 5). Meta-analyzable subgroups were available for MGIT 960 and resazurin testing for ofloxacin resistance, both of which showed excellent accuracy (Table 5).

Reproducibility.

A limited number of studies were included in our systematic review on reproducibility. For first-line drugs, three papers contributed 24 studies for the data on test reproducibility (see Table S3 in the supplemental material) (2123). Five studies assessed the intralaboratory and three studies the interlaboratory reproducibility of isoniazid DST using 7H10, LJ, MGIT 960, and alamarBlue. The agreement within and between sites was high (90.0% to 100%). Rifampin reproducibility, evaluated using the same tests as isoniazid, was excellent within sites (98.7% to 100%, 5 studies) and between sites (95.0% to 99.2%, 3 studies). Eight studies evaluated ethambutol reproducibility. Agreement for studies that evaluated conventional DST was high (97.1 to 98.8) using the 7H10 PM (critical concentration, 5.0; 2 studies) and LJ resistance ratios (100%, 1 study). In the single paper that evaluated MGIT 960, agreement was 92.2% for intralaboratory and 97.1% for interlaboratory evaluations. There was only moderate agreement (80.3%, 1 study) for alamarBlue for within-laboratory reproducibility.

For second-line drugs, five papers were identified that contributed 19 studies on test reproducibility (see Table S3 in the supplemental material) (2125). There were only one or two studies for each second-line drug, except for streptomycin (8 studies). For streptomycin, MGIT 960 (critical concentration, 1.0; 2 studies) and the 7H10 proportion method (critical concentration, 2.0; 2 studies) yielded agreement of 92.2% to 96.2% and 91.3% to 94.0%, respectively.

CONCLUSIONS

We performed a systematic review and meta-analysis of the diagnostic accuracy of drug susceptibility testing for first-line and second-line antituberculosis medications. We also collected data on the reproducibility of included tests. Our findings agree with the literature in a number of ways, including support for the excellent performance of commercial broth tests in DST for isoniazid and rifampin. In addition, we found high performance of MODS and NRAs in DST for isoniazid and rifampin resistance.

We identified the following key points.

  1. We found poor and variable agreement when testing for ethambutol DST for all methods, including MGIT 960 at the currently recommended critical concentration of 5 μg/ml. This poor agreement appears to be driven by moderate sensitivity of the tests in detecting ethambutol resistance. Although there may be a number of explanations for this poor performance (e.g., an issue with the reference standard “over-calling” resistance), a reevaluation of the currently recommended ethambutol critical concentrations for the index tests studied may be warranted. This recommendation is supported by a recent study published by the U.S. Centers for Disease Control and Prevention (CDC), based on results of a proficiency testing program in which cultures with known susceptibility or resistance were sent by the CDC to participating laboratories over a 15-year period (26). This study concluded that the ethambutol testing with MGIT is not equivalent to the critical concentrations used for the agar proportion methods or Bactec 460. Angra and colleagues recommended evaluation of a different test concentration for ethambutol in MGIT.

  2. The number of studies evaluating the reproducibility of DST for first-line and second-line drugs is limited.

  3. The colorimetric test results should be interpreted with caution. Several of the included studies either did not determine index test critical concentrations in advance or evaluated a range of concentrations that may have been a source of bias.

  4. The evaluation of the best-performing critical concentration of a test was not the primary objective of this review; aside from ethambutol, we are unable to make recommendations regarding evaluations of the currently recommended critical concentrations.

This review had several strengths, including a broad search strategy and inclusion of papers in English, French, and Spanish. All phases of the review, including screening of citations, full-text review, and data extraction, were done by at least two trained reviewers working independently. In addition, we used rigorous statistical methods. To determine pooled accuracy estimates, we used random effects modeling, which provides more conservative estimates than fixed effects modeling when heterogeneity is present.

This review was limited by the lack of consensus regarding the “gold standard” for DST. For example, WHO and CDC recommendations regarding the preferred solid media for first-line DST have varied (4, 27), and many experts in the field consider commercial broth systems to be equivalent to, or more reliable than, those using solid media. We did not accept genotypic tests (commercial or “in-house”) as the reference standard or confirmatory tests for discrepancies based on the search criteria developed with WHO. Two recent studies have raised concerns about phenotypic DST methods for rifampin using the recommended critical concentrations (28, 29). Van Deun and colleagues reported that certain conventional drug susceptibility methods did not detect strains of M. tuberculosis which have rpoB mutations associated with small increases in the drug MIC (28). Furthermore, using the Xpert MTB/RIF assay and gene sequencing, Williamson and colleagues identified four patients (three with clinical information suggestive of resistance) whose M. tuberculosis isolates contained mutations to the rpoB gene but appeared to be rifampin susceptible using phenotypic methods (29). In the Williamson study, the presence of these mutations was associated with diminished treatment effectiveness or treatment failure. Our findings should be cautiously interpreted in light of these emerging data. Further clinical observations are required before the significance of rpoB mutations associated with “low level” resistance can be known.

The review was also limited by the small number of studies in various subgroups, which precluded determination of pooled accuracy estimates. Where data were not reported in papers, we did not contact authors and considered this information not reported (missing). An additional concern is that analyses were often performed on a “per sample” basis, meaning that results could be affected in cases in which several samples were taken from a patient. In addition, we may have missed papers that were published in languages that were not included in the review. Finally, interpretation of the data is limited in that the majority of studies were case control or “not reported” in design. Ideally, diagnostic studies should employ consecutive or random sampling of eligible patients with the suspected disease or condition to limit the potential for bias, although we acknowledge that this is difficult for the evaluation of drug resistance where there are low resistance rates. Data included in this review did not allow formal assessment of publication bias using methods such as funnel plots or regression tests because such techniques are not recommended for diagnostic studies (16, 30).

It should be noted that one of the techniques considered to be a reference method in this study, the Bactec radiometric method using 12B medium and the 460TB instrument, is no longer available. This method was well regarded and played a critical role in the development of currently recommended critical concentrations. Due to its wide acceptance as an accurate method for DST, we included studies that used 12B radiometric medium as the reference standard in evaluating the accuracy of newer methods, such as MGIT medium, MODS, and colorimetric redox methods. Future diagnostic evaluations are unlikely to have access to this method.

In summary, this systematic review provides support for the critical concentrations recommended for commercial broth systems for DST for isoniazid and rifampin. Further studies are needed to evaluate critical concentrations for ethambutol and streptomycin that would accurately detect significant changes in susceptibility to these drugs. Evidence is limited on the performance of drug susceptibility testing for pyrazinamide and most second-line drugs.

Supplementary Material

Supplemental material

ACKNOWLEDGMENTS

We thank Sherry Dodson and Yuki Durham (University of Washington) and Vittoria Lutje (Cochrane Infectious Diseases Group and Liverpool School of Medicine) for assistance with literature searching.

This work was supported in part by the World Health Organization Stop TB Department.

WHO had no role in the data collection, data analysis, data interpretation, or writing of the manuscript. We had full access to the data and are solely responsible for the decision to submit these results for publication.

Footnotes

Published ahead of print 14 November 2012

Supplemental material for this article may be found at http://dx.doi.org/10.1128/JCM.02724-12.

REFERENCES

  • 1. Zignol M, van Gemert W, Falzon D, Sismanidis C, Glaziou P, Floyd K, Raviglione M. 2012. Surveillance of anti-tuberculosis drug resistance in the world: an updated analysis, 2007–2010. Bull. World Health Organ. 90:111–119D [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. World Health Organization 2011. Guidelines for the programmatic management of drug-resistant tuberculosis—2011 update. World Health Organization, Geneva, Switzerland: http://whqlibdoc.who.int/publications/2011/9789241501583_eng.pdf Accessed 25 February 2012 [PubMed] [Google Scholar]
  • 3. World Health Organization 2011. Global tuberculosis control: WHO report 2011. WHO/HTM/TB/2010.7. World Health Organization, Geneva, Switzerland: http://www.who.int/tb/publications/global_report/2011/gtbr11_full.pdf Accessed 25 February 2012 [Google Scholar]
  • 4. World Health Organization 2008. World Health Organization Policy guidance on drug-susceptibility testing (DST) of second-line antituberculosis drugs. WHO/HTM/TB/2008.392. World Health Organization, Geneva, Switzerland: http://whqlibdoc.who.int/hq/2008/WHO_HTM_TB_2008.392_eng.pdf Accessed 25 February 2012 [PubMed] [Google Scholar]
  • 5. Kim SJ. 2005. Drug-susceptibility testing in tuberculosis: methods and reliability of results. Eur. Respir. J. 25:564–569 [DOI] [PubMed] [Google Scholar]
  • 6. Kim SJ, Espinal MA, Abe C, Bai GH, Boulahbal F, Fattorin L, Gilpin C, Hoffner S, Kam KM, Martin-Casabona N, Rigouts L, Vincent V. 2004. Is second-line anti-tuberculosis drug susceptibility testing reliable? Int. J. Tuberc. Lung Dis. 8:1157–1158 [PubMed] [Google Scholar]
  • 7. Bwanga F, Hoffner S, Haile M, Joloba ML. 2009. Direct susceptibility testing for multi drug resistant tuberculosis: a meta-analysis. BMC Infect. Dis. 9:67 doi:10.1186/1471-2334-9-67 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Chang KC, Yew WW, Chan RC. 2010. Rapid assays for fluoroquinolone resistance in Mycobacterium tuberculosis: a systematic review and meta-analysis. J. Antimicrob. Chemother. 65:1551–1561 [DOI] [PubMed] [Google Scholar]
  • 9. Chang KC, Yew WW, Zhang Y. 2009. A systematic review of rapid drug susceptibility tests for multidrug-resistant tuberculosis using rifampin resistance as a surrogate. Expert Opin. Med. Diagn. 3:99–122 [DOI] [PubMed] [Google Scholar]
  • 10. Ling DI, Zwerling AA, Pai M. 2008. GenoType MTBDR assays for the diagnosis of multidrug-resistant tuberculosis: a meta-analysis. Eur. Respir. J. 32:1165–1174 [DOI] [PubMed] [Google Scholar]
  • 11. Martin A, Panaiotov S, Portaels F, Hoffner S, Palomino JC, Angeby K. 2008. The nitrate reductase assay for the rapid detection of isoniazid and rifampicin resistance in Mycobacterium tuberculosis: a systematic review and meta-analysis. J. Antimicrob. Chemother. 62:56–64 [DOI] [PubMed] [Google Scholar]
  • 12. Martin A, Portaels F, Palomino JC. 2007. Colorimetric redox-indicator methods for the rapid detection of multidrug resistance in Mycobacterium tuberculosis: a systematic review and meta-analysis. J. Antimicrob. Chemother. 59:175–183 [DOI] [PubMed] [Google Scholar]
  • 13. Minion J, Leung E, Menzies D, Pai M. 2010. Microscopic-observation drug susceptibility and thin layer agar assays for the detection of drug resistant tuberculosis: a systematic review and meta-analysis. Lancet Infect. Dis. 10:688–698 [DOI] [PubMed] [Google Scholar]
  • 14. Morgan M, Kalantri S, Flores L, Pai M. 2005. A commercial line probe assay for the rapid detection of rifampicin resistance in Mycobacterium tuberculosis: a systemic review and meta-analysis. BMC Infect. Dis. 5:62 doi:10.1186/1471-2334-5-62 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Leeflang MM, Deeks JJ, Gatsonis C, Bossuyt PM. 2008. Systematic reviews of diagnostic test accuracy. Ann. Intern. Med. 149:889–897 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Macaskill P, Gatsonis C, Deeks JJ, Harbord RM, Takwoingi Y. 2010. Chapter 10: analysing and presenting results. In Deeks JJ, Bossuyt PM, Gatsonis C. (ed), Cochrane handbook for systematic reviews of diagnostic test accuracy, version 090. The Cochrane Collaboration, Birmingham, England: http://srdta.cochrane.org/ Accessed 6 March 2011 [Google Scholar]
  • 17. Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MM, Sterne JA, Bossuyt PM. 2011. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med. 155:529–536 [DOI] [PubMed] [Google Scholar]
  • 18. StataCorp 2009. Stata Statistical Software: release 11. StataCorp, College Station, TX [Google Scholar]
  • 19. Zamora J, Abraira V, Muriel A, Khan KS, Coomarasamy A. 2006. Meta-DiSc: a software for meta-analysis of test accuracy data. BMC Med. Res. Methodol. 6:31 doi:10.1186/1471-2288-6-31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Reitsma JB, Glas AS, Rutjes AW, Scholten RJ, Bossuyt PM, Zwinderman AH. 2005. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J. Clin. Epidemiol. 58:982–990 [DOI] [PubMed] [Google Scholar]
  • 21. Giampaglia CM, Martins MC, Vieira GB, Vinhas SA, Telles MA, Palaci M, Marsico AG, Hadad DJ, Mello FC, Fonseca Lde S, Kritski A. 2007. Multicentre evaluation of an automated BACTEC 960 system for susceptibility testing of Mycobacterium tuberculosis. Int. J. Tuberc. Lung Dis. 11:986–991 [PubMed] [Google Scholar]
  • 22. Laszlo A, Helbecque DM, Tostowaryk W. 1987. Proficiency testing of conventional drug susceptibility tests of Mycobacterium tuberculosis. Can. J. Microbiol. 33:1064–1068 [DOI] [PubMed] [Google Scholar]
  • 23. Leonard B, Coronel J, Siedner M, Grandjean L, Caviedes L, Navarro P, Gilman RH, Moore DA. 2008. Inter- and intra-assay reproducibility of microplate Alamar blue assay results for isoniazid, rifampicin, ethambutol, streptomycin, ciprofloxacin, and capreomycin drug susceptibility testing of Mycobacterium tuberculosis. J. Clin. Microbiol. 46:3526–3529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Lin SY, Desmond E, Bonato D, Gross W, Siddiqi S. 2009. Multicenter evaluation of Bactec MGIT 960 system for second-line drug susceptibility testing of Mycobacterium tuberculosis complex. J. Clin. Microbiol. 47:3630–3634 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Rüsch-Gerdes S, Pfyffer GE, Casal M, Chadwick M, Siddiqi S. 2006. Multicenter laboratory validation of the BACTEC MGIT 960 technique for testing susceptibilities of Mycobacterium tuberculosis to classical second-line drugs and newer antimicrobials. J. Clin. Microbiol. 44:688–692 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Angra PK, Taylor TH, Iademarco MF, Metchock B, Astles JR, Ridderhof JC. 2012. Performance of tuberculosis drug susceptibility testing in U.S. laboratories from 1994 to 2008. J. Clin. Microbiol. 50:1233–1239 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kent PT, Kubica GP. 1985. Public health mycobacteriology: a guide for the level III laboratory. U.S. Department of Health and Human Services, Atlanta, GA [Google Scholar]
  • 28. Van Deun A, Barrera L, Bastian I, Fattorini L, Hoffmann H, Kam KM, Rigouts L, Rusch-Gerdes S, Wright A. 2009. Mycobacterium tuberculosis strains with highly discordant rifampin susceptibility test results. J. Clin. Microbiol. 47:3501–3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Williamson DA, Roberts SA, Bower JE, Vaughan R, Newton S, Lowe O, Lewis CA, Freeman JT. 2012. Clinical failures associated with rpoB mutations in phenotypically occult multidrug-resistant Mycobacterium tuberculosis. Int. J. Tuberc. Lung Dis. 16:216–220 [DOI] [PubMed] [Google Scholar]
  • 30. Tatsioni A, Zarin DA, Aronson N, Samson DJ, Flamm CR, Schmid C, Lau J. 2005. Challenges in systematic reviews of diagnostic technologies. Ann. Intern. Med. 142(12 Pt 2):1048–1055 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material

Articles from Journal of Clinical Microbiology are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES