Abstract
Context:
The appropriate role of direct total testosterone (T) immunoassays in reproductive research is controversial.
Objective:
To assess the concordance between two direct immunoassays and a liquid chromatography-tandem mass spectrometry (LC-MS/MS) assay for total T in adolescent girls with measured concentrations < 50 ng/dl.
Design:
Cross-sectional analysis.
Setting:
Academic medical center.
Participants:
Adolescent girls (age 8.4–18.1 years) participating in clinical research protocols.
Intervention:
Paired blood samples were obtained for total T by LC-MS/MS (n = 66; Mayo Clinic Laboratory) and by direct immunoassay (Center for Research in Reproduction)—either radioimmunoassay (RIA; n = 31) or chemiluminescence immunoassay (CLIA; n = 35). At the time of assay, laboratories were unaware that results would be compared.
Main Outcome Measure:
Measurement agreement between immunoassay and LC-MS/MS.
Results:
Measured T concentrations (LC-MS/MS) were < 7 to 44 ng/dl. The average difference between RIA and LC-MS/MS was 0.84 [−0.89, 2.56] ng/dl (mean [95% confidence interval]). RIA correlated very strongly with LC-MS/MS (r = 0.899; p < 0.0001); and both Deming regression and Bland-Altman analysis suggested no bias. The average difference between chemiluminescence and LC-MS/MS was 1.39 [−0.83, 3.60] ng/dl. CLIA correlated strongly with LC-MS/MS (r = 0.806; p < 0.0001). While Bland-Altman analysis suggested no systematic bias, Deming regression analysis suggested that, as measured values increased, values obtained by CLIA tended to be progressively, albeit only modestly, higher than those obtained by LC-MS/MS.
Conclusions:
These data support the use of rigorously-performed and carefully-validated direct T immunoassays in high-quality endocrine research in peripubertal adolescent girls.
Keywords: Testosterone, immunoassay, radioimmunoassay, chemiluminescence, liquid chromatography-tandem mass spectrometry
Introduction
Scientific rigor in endocrine research mandates the use of reliable assay techniques. The role of direct sex steroid immunoassays in reproductive research has been a topic of considerable debate in recent years. For example, many direct immunoassays for total testosterone (T) demonstrate poor accuracy at low concentrations—i.e., concentrations typically observed in women and children—and may exhibit bias (overestimation) related to poor specificity (1, 2). For these reasons, some have asserted that direct immunoassays for sex steroids, including total T, are unfit for use in high-quality clinical research (3). However, not all sex steroid immunoassays are created equally, nor are they deployed with equal rigor. Moreover, although mass spectrometry (MS)-based sex steroid assays generally demonstrate several advantages over direct immunoassay, including greater precision and specificity, MS-based assays are more complex to perform, significantly more costly, and of limited availability within a clinical research setting (4).
The Ligand Assay and Analysis Core of the Center for Research in Reproduction (LAAC-CRR; University of Virginia) is a research laboratory supported by the Fertility and Infertility Branch (FIB) of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD). The LAAC-CRR performs hormone assays for numerous FIB-supported investigators, including the Reproductive Medicine Network (RMN). In a 2010 study employing serum samples obtained from women with the polycystic ovary syndrome (PCOS), members of the RMN directly compared total T results obtained from two different liquid chromatography-tandem mass spectrometry (LC-MS/MS) assays—Mayo Clinic Laboratory and Quest Diagnostics—with results obtained via direct radioimmunoassay (RIA) performed in the LAAC-CRR. In brief, median estimated T concentrations were slightly higher with RIA—50 vs. 47 (Mayo) and 41 (Quest) ng/dl—but correlations between RIA and MS-based assays were good (0.79 with Mayo; 0.67 with Quest) (5). Not surprisingly, inter-assay variation was highest for T levels ≤ 50 ng/dl.
Our clinical research studies of adolescent girls typically involve numerous T measurements per subject, and the routine use of LC-MS/MS has been cost-prohibitive; therefore, our group has largely relied on the direct T immunoassays offered by the LAAC-CRR. Although some publications have addressed total T RIA performance vis-à-vis MS-based assays in samples obtained from adolescent girls (6–8), none involved the LAAC-CRR’s RIA assay. Moreover, most steroid RIA kits used by the LAAC-CRR were discontinued by the manufacturer in 2014; and the LAAC-CRR has employed a chemiluminescence immunoassay (CLIA) for total T measurements since April 2016.
In recent years, the research protocols designed by two of the authors (CRM, CBS) have involved sending selected serum samples to the Mayo Clinic for T by LC-MS/MS assay, primarily to evaluate the reliability of the LAAC-CRR’s immunoassay results. Herein we present a formal comparison of two direct immunoassays (RIA and CLIA) with LC-MS/MS in adolescent girls with T values < 50 ng/dl. The analyses support the notion that, when performed rigorously and carefully validated, direct T immunoassays are fit for use in high-quality reproductive research.
Experimental
Subjects
The analysis of RIA involved 31 paired samples obtained from 24 girls with the following characteristics (mean ± SD [median, range]): age 13.0 ± 2.4 (13.6, 8.4–16.3) years; bone age 14.4 ± 2.6 (15.3, 7.5–18) years; Tanner breast stage 4.0 ± 1.6 (5, 1–5); BMI z-score 1.84 ± 0.92 (2.07, −0.54–2.82); body fat percentage 36.6 ± 9.5 (38.0, 16.7–53.5) percent; SHBG 27.4 ± 30.4 (16.5, 5–141) nmol/liter. Sixteen subjects were postmenarcheal, with time since menarche being 1.7 ± 1.0 (2, <0.5–4) years in these girls. Of the 8 premenarcheal subjects, one had evidence for hirsutism (age 14.6 years, Tanner stage 5). Of the 16 postmenarcheal girls, 3 subjects had isolated irregular menses (one of whom was < 2 years postmenarcheal); 4 had isolated hirsutism; 3 had both irregular menses and hirsutism; and 6 had neither. Self-reported race/ethnicity was Black/African American (n = 7), Black/African American + White (n = 1), Hispanic (n = 3), and White (n = 13).
The analysis of CLIA involved 35 paired samples from 12 girls with the following characteristics: age 14.2 ± 1.9 (14.1, 12.1–18.1) years; bone age 14.8 ± 1.1 (15.0, 13.5–17) years; Tanner breast stage 4.0 ± 0.8 (4, 3–5); BMI z-score 1.36 ± 0.95 (1.27, −0.64–2.62); body fat percentage 30.7 ± 5.4 (30.8, 21.3–38.6) percent; SHBG 34.4 ± 24.0 (28.3, 8–79) nmol/liter. All of these subjects were postmenarcheal, with time since menarche being 2.7 ± 2.2 (2, 0.5–8) years. Two subjects had isolated irregular menses (both were 2 years postmenarcheal); one had isolated hirsutism; 3 had both irregular menses and hirsutism (all 2 years postmenarcheal); and 6 had neither. Self-reported race/ethnicity was Asian (n = 1), Black/African American (n = 1), Hispanic (n = 1), and White (n = 9).
Study Procedures
Samples were obtained from adolescent girls participating in six different study protocols identified below by Clinical Trials registration number. Each study was approved by the Institutional Review Board at the University of Virginia. Prior to participation, all study volunteers and their custodial parents gave written informed assent and consent, respectively.
: This is a study of adolescent girls designed to assess the relative roles of insulin resistance, hyperinsulinemia, and luteinizing hormone (LH) in adolescent obesity-associated hyperandrogenemia. For the current analysis, blood was taken at 0700, 0730, 0800, and 0830 h; these samples were pooled and submitted for T assay by LC-MS/MS and either RIA (n = 20) or CLIA (n = 1).
: This is a randomized, placebo-controlled crossover study to assess whether progesterone suppresses waking LH pulse frequency more so than sleep-related LH pulse frequency (9). For the current analysis, blood was taken at 1800, 0000, 0600, and 1200 h; these samples were pooled and submitted for T assay by LC-MS/MS and either RIA (n = 14) or CLIA (n = 15).
: This is a study of early pubertal girls with and without hyperandrogenemia to assess LH pulse secretion during waking and sleeping hours. For the current analysis, blood was taken at 1800, 0000, 0600, and 1200 h; these samples were pooled and submitted for T assay by LC-MS/MS and RIA (n = 2).
: This is a randomized, placebo-controlled crossover study to assess whether spironolactone influences progesterone suppression of waking vs. sleep-related LH pulse frequency. For the current analysis, blood was taken at 0700 h and submitted for T assay by LC-MS/MS and CLIA (n = 3).
: This study was designed to evaluate whether metformin improves adrenal and ovarian androgen responses to ACTH given at 0700 h and r-hCG given at 0800 h. For the current analysis, blood was taken a 4 different time points: prior to study procedures; before ACTH administration; 60 minutes after ACTH administration; and 24-hours after r-hCG administration. Such samples were obtained at baseline and again after 3 months of metformin (500–1000 mg BID). Available samples were submitted for T assay by LC-MS/MS and CLIA (n = 13).
: This study was designed to evaluate whether spironolactone improves adrenal and ovarian androgen responses to ACTH given at 0700 h and r-hCG given at 0800 h. For the current analysis, blood was taken a 4 different time points: prior to study procedures; before ACTH administration; 60 minutes after ACTH administration; and 24-hours after r-hCG administration. Such samples were obtained at baseline and again after 3-months of spironolactone (50–100 mg BID). Available samples were submitted for T assay by LC-MS/MS and CLIA (n = 7).
Assays
Immunoassays were validated using a protocol based on the recommendations of the Endocrine Society’s Sex Steroid Assays Reporting Task Force (10). Evaluations included the following indices: accuracy, linearity, functional sensitivity, precision and correlation to a previous or established method. Assay characteristics for total testosterone by RIA (Siemens Healthcare, Tarrytown, NY; kit catalog reference TKTT1) were as follows: sensitivity, 10 ng/dl; intra-assay coefficient of variation (CV), 3.5–7.4%; inter-assay CV, 5.8–9.6%. Assay characteristics for total testosterone by chemiluminescence immunoassay (Immulite 2000; Siemens Healthcare, Tarrytown, NY; kit catalog reference L2KTW2) were as follows: sensitivity, 10 ng/dl; intra-assay CV, 4.4%; inter-assay CV, 7.5%. Testosterone by LC-MS/MS was performed by the Mayo Clinic Laboratory through normal send-out mechanisms via the University of Virginia Clinical Laboratory: sensitivity, 7 ng/dl, intra-assay CV, ≤ 6.0%, inter-assay CVs, 7.9–15.8% at 12.0–48.6 ng/dl. The RIA assay requires 100 l serum for duplicate T measurements; the CLIA assay requires 40 µl serum for duplicate T measurements; and the LC-MS/MS assay requires 215 µl serum for singlet measurements.
Sample processing procedures were slightly different for immunoassay and LC-MS/MS (11). For immunoassays, blood was withdrawn via an indwelling intravenous catheter into serum separator tubes. Blood was allowed to clot at room temperature for 30 minutes prior to centrifugation (3000 rpm for 10 minutes). Serum was immediately removed and stored at –20 degrees Celsius prior to analysis, which occurred within several days for studies and . Studies , , , and involved two CRU admissions separated by 2–3 months; in these studies, half of the samples were stored for several days prior to analysis, while the other half were stored for several months prior to analysis. Samples for LC-MS/MS assay were prepared in accordance with the instructions provided by the Mayo Clinic Laboratory. Specifically, blood was withdrawn into plain red-top tubes, centrifuged (3000 rpm for 10 minutes), and refrigerated (3–6 degrees Celsius). At the end of the CRU admission, specimens were pooled as indicated and then immediately sent to the Mayo Clinic Laboratory for analysis.
All samples were assayed during the normal course of each study; and at the time of assay, neither the LAAC-CRR (including author DJH, Director of the LAAC-CRR) nor the Mayo Clinic Laboratory was aware that the results would be used for formal comparisons of assay methods. Immunoassays were performed in duplicate and results averaged, as per our clinical research group’s usual practice. Given budgetary constraints, T by LC-MS/MS was performed in singlet only. Samples with measured values below assay sensitivity were assigned the value of assay sensitivity: 10 ng/dl for immunoassay and 7 ng/dl for LC-MS/MS. To convert conventional to SI units: total testosterone (ng/dl) x 3.467 (nmol/liter).
Statistical Analysis
Intra- and inter-assay data summarization:
Two direct inter-assay comparisons were performed: (1) RIA vs. LC-MS/MS (31 data pairs), and (2) CLIA vs. LC-MS/MS (35 data pairs). For each comparison, descriptive statistics were used to summarize both (a) empirical distributions of T measurements from each assay, and (b) the empirical distribution of inter-assay measurement discrepancy (i.e., immunoassay result minus LC-MS/MS result). The coefficients of variation (CV) served as the measure of relative variability in inter-assay measurement discrepancy. Pearson product-moment (parametric) and Spearman rank-order (non-parametric) correlation coefficients served as measures of inter-assay bivariate correlation.
Deming regression:
Deming regression was utilized to test for systematic bias and for non-unity proportional differences between immunoassay and LC-MS/MS. In contrast to simple ordinary least-squares regression, which assumes random measurement error in the dependent y-variable only, Deming regression assumes random measurement error in both the dependent y-variable and the independent x-variable (12). This feature allows examination of two important aspects of measurement agreement, systematic bias and non-unity proportional differences. The test for systematic bias is contingent on the Deming regression intercept parameter. If the 95% confidence interval for the regression intercept parameter excludes the value 0, the null hypothesis (i.e., that there is no systematic bias between the two sets of measurements) is rejected. The test for non-unity proportional difference between the two measurements is contingent on the Deming regression slope parameter. If the 95% confidence interval for the regression slope parameter excludes the value 1, the null hypothesis (i.e., unity proportional difference between the two sets of measurements) is rejected.
Bland-Altman analyses:
Bland-Altman analyses were conducted to further assess measurement agreement between immunoassay (RIA or CLIA) and the reference LC-MS/MS assay. This method involves not only estimating the marginal (i.e., average) inter-method measurement discrepancy, but also estimating the lower and upper limits of agreement for a single inter-method bivariate observation—i.e., the range of inter-method measurement discrepancies that would be expected within approximately ± 2 standard deviations of the marginal discrepancy. Bland-Altman analysis is often summarized in the form of a scatterplot where the paired differences (e.g., difference between immunoassay and LC-MS/MS) are plotted on the y-axis and the reference method values (e.g., LC-MS/MS) are plotted on the x-axis. The marginal discrepancy is identified by one horizontal line, while the lower and upper 95% confidence limits for the marginal discrepancy are identified by two additional horizontal lines. The lower and upper limits of measurement agreement for a single inter-method bivariate observation are identified by two additional horizontal lines. This graphical representation permits a visual assessment of the degree to which the results of two methods of measurement agree and how agreement may vary across the range of measured values (13, 14).
Assessments of between-group vs. within-group variability:
A subset of subjects contributed multiple assay measurements to these analyses. We addressed this by assessing the inter-assay measurement discrepancy intra-class correlation coefficient (ICC). The ICC is a function of the between-group variability and the within-group variability, and it identifies how strongly units in the same group resemble each other. The ICC coefficient has a range of −1 to 1, and values close to zero indicate that the inter-assay measurement discrepancy values from the same individual are no more alike than two inter-assay measurement discrepancy values from two unique individuals.
All statistical testing was performed using SAS version 9.4 (SAS Institute Inc., Cary, NC). Unless otherwise stated, data are reported as mean (95% confidence interval [CI] for the mean).
Results
RIA vs. LC-MS/MS
Table 1 includes summary statistics for T results by each method, the absolute difference between methods, and inter-assay CVs. The mean difference between RIA and LC-MS/MS was 0.84 ± 4.70 ng/dl (mean ± SD). The mean inter-assay CV was 14.6 ± 13.0 (95% CI: [9.9, 19.4]) percent. The correlation between RIA and LC-MS/MS results was very strong in both parametric and non-parametric analyses (r = 0.899 and 0.876, respectively; p < 0.0001 for both). In four samples, T was undetectable by RIA only; in two samples, T was undetectable by LC-MS/MS only; in four samples, T was undetectable by both RIA and LC-MS/MS (Figure 1A).
Table 1. Summary statistics.
Abbreviations: CLIA, chemiluminescence immunoassay; CL, confidence limit; CV, coefficient of variation; LC-MS/MS, liquid chromatography-tandem mass spectrometry; RIA, radioimmunoassay; T, testosterone.
| RIA vs. LC-MS/MS (n = 31 pairs) | CLIA vs. LC-MS/MS (n = 35 pairs) | |||||||
|---|---|---|---|---|---|---|---|---|
| RIA T (ng/dl) | LC-MS/MS T (ng/dl) | Absolute difference* (ng/dl) | Inter-assay CV (%) | CLIA T (ng/dl) | LC-MS/MS T (ng/dl) | Absolute difference* (ng/dl) | Inter-assay CV (%) | |
| Mean | 20.8 | 20.0 | 0.84 | 14.6 | 21.4 | 20.0 | 1.39 | 18.0 |
| Standard deviation | 10.6 | 10.3 | 4.70 | 13.0 | 10.9 | 8.6 | 6.45 | 12.7 |
| Standard error of the mean | 1.9 | 1.8 | 0.84 | 2.3 | 1.8 | 1.5 | 1.09 | 2.2 |
| Lower 95% CL for mean | 16.9 | 16.2 | −0.89 | 9.9 | 17.7 | 17.1 | −0.83 | 13.7 |
| Upper 95% CL for mean | 24.7 | 23.7 | 2.56 | 19.4 | 25.2 | 23.0 | 3.60 | 22.4 |
| Minimum | 10.0 | 7.0 | −9.9 | 0.0 | 10.0 | 7.5 | −9.70 | 1.4 |
| 25th percentile | 10.2 | 10.5 | −1.7 | 2.9 | 12.5 | 13.5 | −3.20 | 8.2 |
| Median | 16.6 | 21.0 | 0.0 | 12.2 | 17.6 | 19.0 | 1.15 | 15.1 |
| 75th percentile | 28.7 | 27.5 | 3.0 | 24.9 | 29.0 | 24.0 | 5.80 | 23.5 |
| Maximum | 47.1 | 43.0 | 11.1 | 46.5 | 44.4 | 44.0 | 17.40 | 46.9 |
Absolute difference = immunoassay result minus LC-MS/MS result. To convert conventional units to SI units: total testosterone (ng/dl) x 3.467 (nmol/liter).
Figure 1. Deming regression and Bland-Altman plots, RIA vs. LC-MS/MS.
Panel A: Deming regression summary for the regression of RIA results against paired LC-MS/MS results. The thick solid line identifies the regression line of best fit, while the shaded area identifies the 95% confidence interval for predicting mean RIA T as a function of LC-MS/MS T. The dashed lines identify the lower and upper bounds of the 95% prediction interval for predicting RIA T as a function of LC-MS/MS T. The thin dotted line identifies the line of identity (i.e., RIA = LC-MS/MS). The inset shows how relevant data points relate to assay limits of detection; note that 4 samples were undetectable by both methods. Panel B: Bland-Altman plot for the measurement discrepancy between RIA assay and the LC-MS/MS assay reference. The solid line identifies the mean measurement discrepancy, and shaded area identifies the lower and upper 95% confidence limits for mean measurement discrepancy. Dashed lines identify the Bland-Altman lower and upper limits of agreement, which approximate the mean measurement discrepancy ± 2 standard deviations. To convert conventional to SI units: total T (ng/dl) x 3.467 (nmol/liter).
By Deming regression (Figure 1A), the estimate for intercept was 0.16 (95% CI: [−3.25, 3.67]) ng/dl, and the estimate for slope was 1.03 (95% CI: [0.88, 1.19]). Since these estimates were not significantly different from 0 and 1, respectively, this analysis suggests no systematic bias between assay measurements and that the proportional difference between assays is equal to 1.
The Bland-Altman plot is shown in Figure 1B. Mean measurement discrepancy between RIA and LC-MS/MS was 0.84 (95% CI: [−0.89, 2.56]) ng/dl (p = 0.329), providing no systematic bias between methods. Bland-Altman lower and upper limits of agreement were −8.76 and 10.43 ng/dl, respectively. Utilizing the data in Table 1, we calculated that it would require a total of 244 independent paired RIA and LC-MS/MS measurements to demonstrate a statistically-significant 0.84-unit systematic bias between assays, assuming the need for 80% statistical power and a two-sided type I error rate of 0.05.
For the subsample of six subjects who contributed multiple assay measurements, the inter-assay measurement discrepancy intra-class correlation coefficient (ICC) was −0.08. This indicates that the inter-assay measurement discrepancy values from the same individual are essentially no more alike than two inter-assay measurement discrepancy values from two unique individuals.
CLIA vs. LC-MS/MS
Table 1 includes summary statistics for T results by each method, the absolute difference between methods, and inter-assay CVs. The mean difference between CLIA and LC-MS/MS was 1.39 ± 6.45 ng/dl (mean ± SD). The mean inter-assay CV was 18.0 ± 12.7 (95% CI: [13.7, 22.4]) percent. The correlation between methods was very strong in both parametric and non-parametric analyses (r = 0.806 and 0.791, respectively; p < 0.0001 for both). In two samples, T was undetectable by CLIA; but T was detectable by LC-MS/MS in all samples (Figure 2A).
Figure 2. Deming regression and Bland-Altman plots, CLIA vs. LC-MS/MS.
Panel A: Deming regression summary for the regression of CLIA results against paired LC-MS/MS results. The regression line of best fit, the 95% confidence interval, the 95% prediction interval, and the line of identify are indicated as in Figure 2A. The inset shows how relevant data points relate to assay limits of detection; two overlapping data points are identified. Panel B: Bland-Altman plot for the measurement discrepancy between the CLIA assay and the LC-MS/MS assay reference. The mean measurement discrepancy, the 95% confidence limits for mean measurement discrepancy, and the Bland-Altman lower and upper limits of agreement are shown as in Figure 2B. To convert conventional to SI units: total T (ng/dl) x 3.467 (nmol/liter).
By Deming regression (Figure 2A), the estimate for intercept was −5.54 (95% CI: [−11.55, 0.47]) ng/dl, and the estimate for slope was 1.35 (95% CI: [1.06, 1.62]). Since the estimate for intercept was not significantly different from 0, this analysis provides no evidence for systematic bias between assay measurements. However, the estimate for slope was significantly different from 1, suggesting that the proportional difference between the two assays is not equal to 1. Visual inspection of the plot suggests that, as measured values increased within this range, values obtained by CLIA tended to be progressively, albeit only modestly (i.e., by < 10 ng/dl), higher than those obtained by LC-MS/MS results.
The Bland-Altman plot is shown is shown in Figure 2B. Mean measurement discrepancy between RIA and LC-MS/MS was 1.39 (95% CI: [−0.83, 3.60]) ng/dl, providing no evidence for systematic bias between methods. Bland-Altman lower and upper limits of agreement were - 11.71 and 14.49 ng/dl, respectively. Utilizing the data in Table 1, we calculated that it would require a total of 156 independent paired CLIA and LC-MS/MS measurements to demonstrate a statistically-significant 1.39-unit systematic bias between assays, assuming the need for 80% statistical power and a two-sided type I error rate of 0.05.
For the subsample of seven subjects who contributed multiple assay measurements, the inter-assay measurement discrepancy ICC was 0.22. This indicates that the inter-assay measurement discrepancy measurements from the same individual are modestly more alike than two inter-assay measurement discrepancy values obtain from two unique individuals.
Comparison of RIA vs. CLIA
A graphical representation of the distribution of differences between RIA and LC-MS/MS is shown in Figure 3A, and between CLIA and LC-MS/MS in Figure 3B. Figure 3C shows Bland-Altman measurement agreement summaries (against LC-MS/MS) for the two T immunoassays. As described above, mean measurement discrepancy against LC-MS/MS appeared to be slightly higher with CLIA (1.39 ng/dl) compared to RIA (0.84 ng/dl), but this difference (CLIA minus RIA: −0.55; 95% CI: [−3.31, 2.20]) was not statistically significant (p = 0.691). The standard deviation for measurement discrepancy against LC-MS/MS—a measure of variability in measurement discrepancy—appeared to be higher with CLIA (6.44; 95% CI: [5.21, 8.45]) compared to with RIA (4.70; 95% CI: [3.75, 6.28]), but this difference was not statistically significant (p = 0.100). Thus, this analysis does not provide evidence for RIA vs. CLIA differences in measurement agreement with LC-MS/MS. However, firm conclusions may be limited by relatively low statistical power.
Figure 3. Measurement agreement with LC-MS/MS: RIA vs. CLIA.

Panel A: Distribution of differences between paired RIA and LC-MS/MS T results. The distribution of these differences did not deviate significantly from normal (Shapiro-Wilk test, p = 0.340). Panel B: Distribution of differences between paired CLIA and LC-MS/MS T results. The distribution of these differences did not deviate significantly from normal (Shapiro-Wilk test, p = 0.7426). Panel C: This graph summarizes Bland-Altman measurement agreement against LC-MS/MS for both RIA and CLIA. Solid circles identify the mean measurement discrepancy, and the vertical lines represent the lower and upper limits of measurement agreement. The dashed horizontal lines identify the lower 95% confidence bounds for the lower limit of measurement agreement and the upper 95% confidence bounds for the upper limit of measurement of agreement. To convert conventional to SI units: total T (ng/dl) x 3.467 (nmol/liter).
Discussion
The current analysis suggests that the LAAC-CRR’s total T immunoassays, employed for clinical research studies involving multiple samples in adolescent girls, perform well against an MS-based assay in a large and well-respected clinical laboratory (Mayo Clinic Laboratory). In particular, this analysis disclosed excellent concordance between the direct T RIA (Siemens) and LC-MS/MS in adolescent girls. These results confirm earlier findings in adult men and in women with PCOS (5, 15); and they are in keeping with a study of a different RIA assay vs. LC-MS/MS in adolescent girls (8). The current analysis also suggests that the direct T CLIA (Immulite 2000; Siemens Healthcare) compares well to an LC-MS/MS assay, a conclusion that contrasts with previous reports involving LC-MS/MS (15) or isotope-dilution gas chromatography-MS (6). The reason(s) for the discordance between our analysis and earlier reports is (are) unclear. Nonetheless, we believe that this direct comparison study of the LAAC-CRR’s CLIA T assay provides reassurance as we and other NIH-supported researchers strive to maximize scientific rigor in times of limited resources.
Average differences between paired T measurements were small and not significantly different from zero: 0.84 ± 4.7 ng/dl (mean ± SD) between RIA and LC-MS/MS, 1.39 ± 6.45 ng/dl between CLIA and LC-MS/MS. While percent differences suggest higher discordance, this in part reflects low T concentrations per se (i.e., a low number as denominator). For example, at an LC-MS/MS T concentration of 20 ng/dl, a difference of only 5 ng/dl between assays represents a 25% difference. Similarly, undetectable concentrations in both assays rendered a 43% difference between assays because values were assigned the respective assays’ sensitivity (10 vs. 7 ng/dl). Moreover, relatively high inter-assay variability is often observed between different MS-based assays. For example, the analysis of Legro et al. suggested that, for T values 50 ng/dl, the inter-assay CV between two LC-MS/MS assays (Mayo Clinic Laboratory and Quest Diagnostics) was greater than the inter-assay CV between RIA and the Mayo LC-MS/MS (26.2 [95% CI: 23.5, 28.8] vs. 20.6 [95% CI: 18.7, 22.5] percent, respectively) (5). Of interest in this regard, the inter-assay CVs between RIA and LC-MS/MS (14.6 [95% CI: 9.9, 19.4] percent) and between CLIA and LC-MS/MS (18.0 [95% CI: 13.7, 22.4] percent) in the current study appeared to be lower than the inter-assay CV between the two LC-MS/MS assays in the aforementioned study (5). Similarly, in one study comparing seven MS-based assays to a reference MS-based assay performed by the National Institute of Standards and Technology, inter-assay CV ranged from approximately 13% to 33% for T values < 50 ng/dl (16).
The current study has both strengths and weakness. Importantly, this comparison involved “real-world” assay results: all assays were performed during the usual course of study; and at the time of assay performance, laboratories were unaware that this analysis would be performed. A weakness of the current study is the relatively small number of paired data available for analysis. However, small sample sizes are a common limitation in studies involving children and adolescents. Additionally, immunoassays in this analysis were performed in duplicate with the results averaged—which is expected to improve the reliability of results—while LC-MS/MS assays were performed in singlet. This reflects our current research practice, and duplicate immunoassay remains less expensive and requires less serum than singlet LC-MS/MS assay.
In an ideal world, all reproductive endocrinology researchers would employ sex steroid assays with the best available combination of functional sensitivity, specificity, precision, and accuracy. While this goal may be best realized with standardized, well-validated, and well-performed MS-based assays, accessibility to such assays remains a significant challenge. MS-based assay costs are comparatively high, and in the context of significant funding limitations, the routine use of MS-based assays may be impractical for many researchers. The serum volume requirement may also be an important consideration. For example, our physiological studies in adolescents typically involve extended periods of frequent sampling for LH in addition to multiple samples for sex steroids including T. Given strict blood withdrawal limits in children, any technique requiring additional blood withdrawal may be associated with important opportunity costs.
In conclusion, the current analyses suggest that direct total T immunoassay results from a NICHD-supported research laboratory compare favorably with an LC-MS/MS reference from a large and well-respected clinical laboratory. Until standardized, high-quality MS-based assays are ubiquitously accessible to researchers, well-validated and well-performed direct sex steroid immunoassays will continue to play an important role in high-quality reproductive endocrinology research.
Supplementary Material
Highlights.
The appropriate role of direct total testosterone (T) immunoassays is controversial
In adolescent girls (n = 31), correlation between total T RIA and LC-MS/MS: r = 0.899; p < 0.0001
Deming regression/Bland-Altman analysis suggested no bias between RIA and LC-MS/MS
In adolescent girls (n = 35), correlation between total T chemiluminescence and LC-MS/MS: r = 0.806, p < 0.0001
Bland-Altman analysis suggested no bias between chemiluminescence and LC-MS/MS
By Deming regression, chemiluminescence tended to be higher than LC-MS/MS at higher values
Acknowledgements:
We gratefully acknowledge Anne Gabel, Katherine Ehrlich, Deborah M. Sanderson, and Melissa Gilrain for subject recruitment, study scheduling, and assistance with data management. We also extend our gratitude to the nurses and staff of the Clinical Research Unit at University of Virginia for implementation of the sampling protocols, and to the Center for Research in Reproduction Ligand Assay and Analysis Core Laboratory for performance of assays.
Funding: This work was supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development/National Institutes of Health (NIH) through Cooperative Agreement P50 HD28934 as part of the National Centers for Translational Research in Reproduction and Infertility (CRM, CMBS, JTP, JCM, DJH); NIH R01 HD058671 (CRM, JTP); and NIH K23 HD070854 (CMBS).
Abbreviations:
- CI
confidence interval
- CLIA
chemiluminescence immunoassay
- CV
coefficient of variation
- FIB
Fertility and Infertility Branch
- LAAC-CRR
Ligand Assay and Analysis Core of the Center for Research in Reproduction
- LC-MS/MS
liquid chromatography-tandem mass spectrometry
- MS
mass spectrometry
- NICHD
Eunice Kennedy Shriver National Institute of Child Health and Human Development
- PCOS
polycystic ovary syndrome
- RIA
radioimmunoassay
- RMN
Reproductive Medicine Network
- T
testosterone
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Declarations of interest: None
References
- 1.Rosner W, Auchus RJ, Azziz R, Sluss PM, Raff H. Position statement: Utility, limitations, and pitfalls in measuring testosterone: an Endocrine Society position statement. J Clin Endocrinol Metab 2007;92:405–413. [DOI] [PubMed] [Google Scholar]
- 2.Handelsman DJ. Mass spectrometry, immunoassay and valid steroid measurements in reproductive medicine and science. Hum Reprod 2017;32:1147–1150. [DOI] [PubMed] [Google Scholar]
- 3.Handelsman DJ, Wartofsky L. Requirement for mass spectrometry sex steroid assays in the Journal of Clinical Endocrinology and Metabolism. J Clin Endocrinol Metab 2013;98:3971–3973. [DOI] [PubMed] [Google Scholar]
- 4.Auchus RJ. Steroid assays and endocrinology: best practices for basic scientists. Endocrinology 2014;155:2049–2051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Legro RS, Schlaff WD, Diamond MP, Coutifaris C, Casson PR, Brzyski RG, Christman GM, Trussell JC, Krawetz SA, Snyder PJ, Ohl D, Carson SA, Steinkampf MP, Carr BR, McGovern PG, Cataldo NA, Gosman GG, Nestler JE, Myers ER, Santoro N, Eisenberg E, Zhang M, Zhang H. Total testosterone assays in women with polycystic ovary syndrome: precision and correlation with hirsutism. J Clin Endocrinol Metab 2010;95:5305–5313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Taieb J, Mathian B, Millot F, Patricot MC, Mathieu E, Queyrel N, Lacroix I, Somma-Delpero C, Boudou P. Testosterone measured by 10 immunoassays and by isotope-dilution gas chromatography-mass spectrometry in sera from 116 men, women, and children. Clin Chem 2003;49:1381–1395. [DOI] [PubMed] [Google Scholar]
- 7.Mouritsen A, Soeborg T, Johannsen TH, Aksglaede L, Sorensen K, Hagen CP, Mieritz MG, Frederiksen H, Andersson AM, Juul A. Longitudinal changes in circulating testosterone levels determined by LC-MS/MS and by a commercially available radioimmunoassay in healthy girls and boys during the pubertal transition. Horm Res Paediatr 2014;82:12–17. [DOI] [PubMed] [Google Scholar]
- 8.Ankarberg-Lindgren C, Norjavaara E. Sensitive RIA measures testosterone concentrations in prepubertal and pubertal children comparable to tandem mass spectrometry. Scand J Clin Lab Invest 2015;75:341–344. [DOI] [PubMed] [Google Scholar]
- 9.Kim SH, Lundgren JA, Bhabhra R, Collins JS, Patrie JT, Burt Solorzano CM, Marshall JC, McCartney CR. Progesterone-mediated inhibition of the GnRH pulse generator: differential sensitivity as a function of sleep status. J Clin Endocrinol Metab 2017. [Epub ahead of print Dec 28, 2017; DOI: 10.1210/jc.2017-02299]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wierman ME, Auchus RJ, Haisenleder DJ, Hall JE, Handelsman D, Hankinson S, Rosner W, Singh RJ, Sluss PM, Stanczyk FZ. Editorial: The new instructions to authors for the reporting of steroid hormone measurements. J Clin Endocrinol Metab 2014;99:4375. [DOI] [PubMed] [Google Scholar]
- 11.Raff H, Sluss PM. Pre-analytical issues for testosterone and estradiol assays. Steroids 2008;73:1297–1304. [DOI] [PubMed] [Google Scholar]
- 12.Cornbleet PJ, Gochman N. Incorrect least-squares regression coefficients in method-comparison analysis. Clin Chem 1979;25:432–438. [PubMed] [Google Scholar]
- 13.Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307–310. [PubMed] [Google Scholar]
- 14.Giavarina D Understanding Bland Altman analysis. Biochem Med (Zagreb) 2015;25:141–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wang C, Catlin DH, Demers LM, Starcevic B, Swerdloff RS. Measurement of total serum testosterone in adult men: comparison of current laboratory methods versus liquid chromatography-tandem mass spectrometry. J Clin Endocrinol Metab 2004;89:534–543. [DOI] [PubMed] [Google Scholar]
- 16.Vesper HW, Bhasin S, Wang C, Tai SS, Dodge LA, Singh RJ, Nelson J, Ohorodnik S, Clarke NJ, Salameh WA, Parker CR Jr., Razdan R, Monsell EA, Myers GL. Interlaboratory comparison study of serum total testosterone [corrected] measurements performed by mass spectrometry methods. Steroids 2009;74:498–503. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


