Abstract
Bead array assays, such as those sold by Luminex, BD Biosciences, Sartorius, Abcam and other companies, are a well-established platform for multiplexed quantification of cytokines and other biomarkers in both clinical and discovery research environments. In 2011, the National Institute of Allergy and Infectious Diseases (NIAID)-funded External Quality Assurance Program Oversight Laboratory (EQAPOL) established a proficiency assessment program to monitor participating laboratories performing multiplex cytokine measurements using Luminex bead array technology. During every assessment cycle, each site was sent an assay kit, a protocol, and blinded samples of human sera spiked with recombinant cytokines. Site results were then evaluated for performance relative to peer laboratories. After over a decade of biannual assessments, the cumulative dataset contained over 15,500 bead array observations collected at more than forty laboratories in twelve countries. These data were evaluated alongside postassessment survey results to empirically test factors that may contribute to variability and accuracy in Luminex bead-based cytokine assays. Bead material, individual technical ability, analyte, analyte concentration, and assay kit vendor were identified as significant contributors to assay performance. In contrast, the bead reader instrument model and the use of automated plate washers were found not to contribute to variability or accuracy, and sample results were found to be highly-consistent between assay kit-manufacturing lots and over time. In addition to these statistical analyses, subjective evaluations identified technical ability, instrument failure, protocol adherence, and data transcription errors as the most common causes of poor performance in the proficiency program. The findings from the EQAPOL multiplex program were then used to develop recommended best practices for bead array monitoring of human cytokines. These included collecting samples to assay as a single batch, centralizing analysis, participating in a quality assurance program, and testing samples using paramagnetic-bead kits from a single manufacturer using a standardized protocol.
Keywords: Luminex, Proficiency Assessment, Multiplex Cytokine Assays
1. Introduction
Bead array immunoassays enable concurrent quantification of multiple analytes in a small sample volume via fluorescently distinct capture beads such as those sold by BD Biosciences, Sartorius, Abcam, Luminex and other corporations (Faresjo, 2014). Commercially-manufactured or laboratory-developed bead arrays can be used in combination with dedicated bead readers and software to identify and interrogate up to 500 distinct bead sets from a single well of a 96-or 384-well plate. Multiplex bead array assays are frequently used to quantify cytokines and other biomarkers in many disciplines.
Several rigorous comparative studies over the past twenty years have evaluated multiplex bead array assay performance in single laboratories alongside other multiplex assay technologies and traditional single-plex ELISAs (e.g. (Lash et al., 2006; Djoba Siawaya et al., 2008; Chowdhury et al., 2009; de Jager et al., 2009; Dossus et al., 2009; Fu et al., 2010; Butterfield et al., 2011; McKay et al., 2017; Gunther et al., 2020)). However, despite the extensive adoption of bead array technology for pre-clinical and clinical research, there has been limited evaluation of the factors contributing to the inter-laboratory variability of assays run on the Luminex platform. Identifying and minimizing these variances are important for optimal longitudinal and multi-site study design, assay harmonization, scientific rigor, and reproducibility.
The External Quality Assurance (EQA) Program Oversight Laboratory (EQAPOL) multiplex program was established in 2011 to provide external proficiency assessment of bead assays performed in laboratories supporting the National Institute of Allergy and Infectious Diseases (NIAID) and Cancer Immunotherapy Consortium (CIC)-supported human trials. Sites participating in the program were sent lot-matched assay kits, protocols and human sera spiked with five recombinant cytokines at multiple concentrations. The blinded spiked samples were assayed in triplicate at each site and the results and raw data analyzed by the EQAPOL Oversight Laboratory (EOL). Following two un-scored external proficiency (EP) rounds to optimize the study plan, assay kit selection, sample panel composition, and proficiency score criteria, formal performance grading was implemented in June 2012 as part of EP3. Two 2014 publications in this journal (Lynch et al., 2014; Rountree et al., 2014) described the founding of the program and the findings from EP1 through EP5 conducted from 2011 to 2013. The program has now completed 22 rounds of proficiency testing with over forty independent laboratories in twelve countries having participated in at least one round.
In this follow-up to the 2014 reports, the site proficiency assessments and associated dataset of about 15,500 measurements obtained from EP6 through EP22 were used to evaluate factors that may contribute to variability in multiplexed cytokine assays using Luminex beads. Bead material, analyte, sample concentration, technical skill/experience, sample short- and long-term stability, assay manufacturer, instrument selection, and other factors were evaluated for their impacts on assay performance. These analyses were then used to inform suggested best practices for bead-array immunoassays.
2. Material and Methods
2.1. Samples, Assay Kits, Instrumentation, and Data Reporting
Proficiency testing was performed largely as described (Lynch et al., 2014; Rountree et al., 2014). Testing sample panels were created using commercial human AB heat-inactivated serum (GeminiBio) and recombinant human cytokines (IL-2, IL-6, IL-10, TNF-α, and IFN-γ all at >97% purity and without added carrier proteins; R&D Systems). Lyophilized cytokines were reconstituted in their manufacturer-recommended carrier buffers, rested at room temperature for 15 minutes, and then spiked into 0.22 μm-filtered serum. Spiked serum was mixed for 15 minutes at room temperature and then aliquoted into single-use vials. The panels were then stored at −80°C and each lot was used for 4 to 8 EP cycles. The homogeneity of the samples was tested in the EOL prior to each EP using a human cytokine five-plex (IL-2, IL-6, IL-10, TNF-α, and IFN-γ) Luminex-bead based assay kit (MilliporeSigma # HCYTOMAG-60K-05) performed according to the manufacturer’s protocol with a two-hour, rather than overnight, sample capture step. For each cytokine in a sample lot, the one-sided lower bound of the 95% confidence interval of the between vial variation had to be less than or equal to 0.3 of the standard deviation of the between vial variation obtained during the previous EP. Any lot found to be non-homogeneous was retested, and if the second result also indicated non-homogeneity then the lot was discarded.
After homogeneity testing, participating sites were provided a panel of blinded vials containing triplicate aliquots of seven test samples spiked with concentrations of cytokines that would fall in particular regions of the standard curves (low, medium-low, medium, medium-high and/or high). Sites also received a human cytokine five-plex (IL-2, IL-6, IL-10, TNF-α, and IFN-γ) assay kit (MilliporeSigma # HCYTOMAG-60K-05). Sites were instructed to test the samples using the kit according to manufacturer’s protocol and adhere to a two-hour incubation period for analyte capture. From EP12 onward, sites had the alternating option of also either i) receiving a second sample set and assay kit from the same lot to allow for technician comparison, or, ii) receiving a second sample set to assay using the Luminex-bead-based assay of their choice. Sites were permitted to read both the common kit and site-selected kit using any available Luminex bead reader and analyze the data using the software and approach of their choosing. Mean Fluorescence Intensity (MFI) and site-determined pg/mL concentration for each well and analyte were transmitted to the EOL for centralized analysis. EPs were performed every six months from EP6 (October 2013) to EP18 (October 2019). No EP was performed in April 2020 as participating laboratories and the EOL were responding to COVID-19. The biannual cycle resumed in October 2020 with EP19.
2.2. Proficiency Scoring and Remediation
At each EP, the proficiency of every site was scored out of 100 using previously described approaches (Lynch et al., 2014; Rountree et al., 2014). Briefly: 10 points (all or none) were awarded for timeliness in returning data to the coordinating site; 10 points were awarded for protocol adherence: five (all or none) for site-reported adherence to instrument setup, assay procedure, plate layout, and data analysis instructions plus a maximum of five points (one per analyte) for the acceptability of the four-parameter logistic curve fit probabilities calculated by the EOL from site-reported MFI data and expected standard concentrations; a maximum of 40 points for the accuracy of the site reported pg/mL concentration compared to the consensus pg/mL concentration of all participating sites using a mixed effects model with a Bonferroni correction; and a maximum of 40 points for precision. Observed coefficients of variance (%CV) per analyte per sample that were outside one (EP1–7) or two (EP8 onwards) standard deviations (SD) of the mean %CV for that analyte and sample resulted in point deductions for precision. The number of unique test samples varied per EP and so the point value deducted per analyte per sample varied to maintain a 40-point weighting.
Sites scoring under 75 points (graded as “Fair” or “Poor”) were invited to remediation discussions. Contemporaneous notes from these discussions were used to categorize likely causes of low scores. Each remediation instance was assigned to a single category.
2.3. Statistical Analysis
Mixed-effects models were used to address the various longitudinal research questions presented in this report. These models used both fixed and random effects and provided for the estimation of longitudinal trend testing while allowing for the calculation of within and between variance estimation (Fitzmaurice et al., 2004). These variance estimates were used to calculate the intraclass correlation (ICC), which was calculated using the following equation: ICC= σb / σb + σw (σb = between site variance; σw = within site variance).
3. Results
3.1. Proficiency of Sites Over Time
The EQAPOL multiplex program conducted 19 rounds of scored proficiency testing (EP4-EP22) from 2013 to 2023 (Supplementary Table 1). The scores each site received per round overall and for both precision (closeness of replicate results within a plate) and accuracy (closeness to consensus pg/mL analyte concentrations determined from the results of all participating sites) are shown in Figures 1A–C. The average scores in each category increased over the early EPs (EP4–6) and then stabilized though to EP22.
Figure 1: Overview of the scores for sites participating in the EQAPOL multiplex external quality assurance program from EP4 to EP22.

The (A) overall, (B) precision, and (C) accuracy scores for each external proficiency (EP) testing cycle plus (D) the percentage of sites at each EP that were offered remediation due to Fair or Poor scores. (A-C) black points show individual sites and red bars indicate mean± 95% CI. The normalized (E) within- and (F) between- site variance in assay results at sites participating in EQAPOL multiplex external proficiency (EP) testing cycles EP6 to EP22. Results that were over three-fold from the consensus average for a given analyte and cytokine were removed prior to analysis as likely technical failures. Plots show the 95% CI.
Sites that received an overall score of less than 75 out of 100 were invited to remediation with members of the EOL. An average of 22% of participating sites per round were offered remediation (Figure 1D and Supplementary Table 1). The first two scored EPs (EP4 and EP5) had the highest percentages of underperforming sites. There was no further sustained change in the percentage of underperforming sites in the nine years following EP5.
3.2. Intra- and Inter-assay Variability Over Time
Every proficiency testing sample was supplied to sites in three differently-coded vials, and each vial was assayed in triplicate wells for a total of nine wells per sample per site per EP cycle. Site precision scores were then calculated using the percentage coefficient of variance (%CV) values for the site-reported pg/mL analyte concentrations in every sample. The mean %CVs for each analyte and each sample across all sites were determined, and sites were penalized for each %CV more than one (EP3–7) or two (EP8 onward) standard deviations above the mean %CV for that analyte and sample. Each site’s precision score (Figure 1B) was therefore determined by the number of high within-assay %CVs relative to peer laboratories participating in the same round of proficiency testing. Similarly, accuracy scores (Figure 1C) were calculated from the number of site-determined analyte concentrations that were statistical outliers from the consensus concentration obtained by peer laboratories. While this peer-based benchmarking was valuable for proficiency assessment as it did not require a priori establishment of “goldstandards” for assay performance, it meant that accuracy and precision scores were always relative to other laboratories in the same round of EQAPOL proficiency testing and so not comparable between testing rounds.
To evaluate if within-assay precision and/or between-site consistency improved as the program progressed, data from likely technical errors were removed (identified as results that were over three-fold from the consensus average for a given analyte and cytokine) and then the normalized within-assay and between-site variances were calculated for EP6-EP22. The normalized within-assay variance increased over time (Figure 1E) and tended to be higher when two technicians per laboratory were invited to participate with each technician scored as a separate site (even numbered EP rounds from EP12 onwards - Table S1). There was, therefore, no improvement in precision as the program progressed. Similarly, normalized between-site variance, reflective of the consistency of the assay results between sites, slightly increased over time and trended higher during two-technician assessments (Figure 1F).
We next interrogated the dataset from EP6-EP22 to evaluate the contribution of various factors to between- and within-site Luminex bead assay variance and to overall EQA performance.
3.3. Factors Contributing to Assay Variability:
3.3.1. Bead Material
Bead array assays can use two different types of beads (also known as microspheres): polystyrene beads, which are washed using a filter plate vacuum manifold, or paramagnetic beads, which are typically washed by magnetic capture. Early EPs (EP1–5) used only polystyrene-based beads. In EP6, a side-by-side comparison of the two bead types was performed. Twenty-six sites assayed blinded aliquots of the same seven test samples using both paramagnetic- and polystyrene bead-based kits supplied by the same vendor (Millipore). The intraclass correlation coefficients (ICC) for each sample, analyte and bead type were then calculated using the between- and within- site variances. ICCs were higher for the paramagnetic bead assays, indicating that there was greater within-site consistency in results obtained using this bead type and that the majority of the variances in the data were due to between-site differences (Figure 2A). As the ICCs show, paramagnetic bead assays exhibited lower intra-assay variance (%CV) for 34 of the 35 comparisons performed (seven samples each tested for five analytes) (Figure 2B) and 60% of these differences in precision were statistically-significant by Wilcoxon Signed-Rank test. Small but statistically-significant differences in the absolute cytokine concentrations obtained by each bead type were detected by paired analysis using the Wilcoxon Signed-Rank test (overall p=0.016) (Figure S1).
Figure 2: Variance in the analyte concentrations reported for seven samples assayed using polystyrene and paramagnetic bead array assays at 26 laboratories.

Twenty-six laboratories assayed blinded aliquots of seven samples using both polystyrene- and paramagnetic- Luminex bead kits from the same vendor. Plots compare the variability of the results obtained with each type of bead. (A) The intraclass correlation of the results for each analyte in each sample across all sites and (B) the average intra-assay coefficient of variation (%CV) obtained for each analyte in each sample. Black bars indicate means.
The precision improvement obtained using paramagnetic beads and magnetic capture-based washing led the EQAPOL multiplex program to switch entirely to this bead type from EP7 onwards. This shift in assay bead type coincided with an increase in average accuracy, precision, and overall scores (Figure 1A–1C). The average site proficiency scores subsequently remained stable over the 16 scored EP that used only magnetic beads.
3.3.2. Analyte of Interest
All immunoassay reactions within a multiplex bead array assay occur within the same well, but each analyte and detection/capture antibody pair have different biophysical properties. The variance for each analyte across the EP6 to EP22 dataset were therefore evaluated to determine if analyte-specific factors contributed to precision. After removing data from clear technical failures by elimination of results with intra-assay %CV values over 40%, the average %CV for site determined-analyte concentration within an assay plate was 8.6%. Within every EP, IL-6 or IL-2 consistently exhibited the highest within-site %CV and TNF-α or IL-10 the lowest (Figure 3A). These observations are reflected in the 95% CI for the normalized within-site variances for the collective dataset (EP6-EP22; n≥3,126 observations per analyte), wherein IL-6 had the highest intra-assay variance and IL-10 the lowest (Figure 3B).
Figure 3: Analyte- and concentration- specific variance and accuracy of multiplex assays at sites participating in the EQAPOL multiplex external quality assurance program from EP6 to EP22.

(A) The average within-site %CV by analyte at sites participating in EQAPOL multiplex external proficiency (EP) tests EP6 to EP22. Values over 40% were excluded as likely technical failures. Dashed lines indicate mean ± 95% CI across all EPs. (B) Normalized within-site variance across all EP cycles. (C) The average fold-change deviation from the consensus concentration by analyte for each participating site over time. (D) The normalized between site variance by analyte across all EP cycles. Fold-changes greater than 3-fold from the consensus were excluded as likely technical failures. (E) and (F) the relationships between (E) within- and (F) between-site variance and average consensus analyte concentration for all analytes across all EP. All bars indicate 95% CI. n=15,721 total observations; n≥3,126 per analyte.
The data were also evaluated for analyte-specific differences in between-site variability. After removing data from clear technical failures by elimination of data falling more than three-fold from the consensus for each analyte per EP, the average result fell within 15% (1.15-fold) of the consensus (Figure 3C). The distance from consensus trended slightly upwards over the course of the program, with IFN-γ consistently falling further from the consensus and IL-10 the closest. When the data from EP6-EP22 shown in Figure 3C were aggregated, the 95% confidence intervals for the between-site variances indicated that IL-2, IL-6, and IFN-γ had the greatest between site-variability and IL-10 the lowest (Figure 3D).
Collectively, these data demonstrated that each analyte in a multiplex assay behaves differently, with analyte-specific factors influencing accuracy and precision. They also indicated that the longitudinal increase in within-site and between-site variability observed over the course of the program (Figure 1E, F) cannot be attributed to changes in the performance of any one analyte within the five-plex panel.
3.3.3. Analyte Concentration
Multiplex bead arrays are often preferred over classical plate-based sandwich ELISAs as they enable analyte quantification over a greater dynamic range; typically, four to six logs (de Jager and Rijkers, 2006). This reduces the need for reruns when assaying samples with unpredictable analyte concentrations. However, assay performance is not necessarily consistent over the entire range of quantification. To evaluate how performance changed with analyte concentration, the 95% CIs for the normalized variance in site-reported analyte concentration for all spiked serum samples across EP6–22 (n=15,721 observations) were plotted against average consensus concentration in pg/mL. The within-site variability was highest at low analyte concentrations (<100 pg/mL) and plateaued at concentrations above 350 pg/mL (Figure 3E). Similar trends were observed for between-site normalized variability (Figure 3F). Thus, multiplex bead array quantification of cytokines in the sub-100 pg/mL range was subject to more variability both within a single assay and across testing laboratories.
3.3.4. Bead Washing Automation
Paramagnetic Luminex beads are usually washed by placing the assay plate on a magnetic base to capture the beads and then rinsing the wells with wash buffer. This process can be performed manually using a handheld base, adding wash buffer by pipette, and then removing the residual buffer by flicking, pipetting, or patting on an absorbent pad. Alternatively, automated plate washers with magnetic bases can be used to capture the beads and then repeatedly add and aspirate wash buffer. The automated approach has been proposed to improve consistency, and sites participating in the EQAPOL multiplex EQA program were free to use either washing approach. Sites reported the washing approach used via a survey completed at time of data submission. The majority of the magnetic-bead-based Luminex assays performed for the EQA program were manually washed using magnetic capture (n=312 of 455 assays (69%) at 32 laboratories). Automated magnetic washers were used for 136 assays (30%) conducted at 13 laboratories, and a minority of assays were washed using a manual vacuum manifold (n=3) or other non-reported method (n=4). The 15,196 observations from automatic and manual magnetically-washed assays were binned by washer type and evaluated for the impact of washer type on accuracy and variance. The average observation was around 1.15-fold of the consensus concentration regardless of washer type, and the 95% CI for the accuracy of the two washer types overlapped, indicating there was no significant difference in accuracy between the two techniques (Figure 4A). Within-site variance was slightly improved (nonoverlapping 95% CI) when using automated washing (Figure 4B), indicating that automated washing resulted in marginally more consistent data between wells. Conversely, between-site variance was significantly elevated for automated washing versus manual magnetic washing, which may reflect that automated washer platforms and operating parameters were not standardized between laboratories.
Figure 4: Effect of instrumentation on the precision and accuracy of paramagnetic bead assays.

Sites participating in EQAPOL multiplex external proficiency (EP) tests EP6 to EP22 reported the bead wash technique and bead reader model used for data collection. Illegible survey responses and observations from likely technical failures (%CV over 40%) were excluded and the remaining observations binned as (A-C) obtained using manual or automated plate washing and (D-E) collected using from a Luminex/Bio-Plex 100/200 (n=11,374) FlexMAP3D/BioPlex 3D (n=1,786), or MagPix (n=2,695). Plots show the average fold-change deviation from consensus analyte concentration ±95% CI for each (A) wash technique and (D) bead reader model. 95% CI of normalized within-site variance by (B) wash technique and (E) bead reader model for EP6-EP22 and 95% CI of normalized between-site variance by (C) wash technique and (F) bead reader model for EP6-EP22.
3.3.5. Bead Reader Instrument Model
Luminex Corporation has manufactured several generations of bead readers that use different underlying technologies and have different performance specifications. These include the closely related flow cytometry-based Luminex 100 and 200 instruments that read up to 100 regions (also rebadged as BioPlex 100/200 by BioRad), flow cytometry-based FlexMAP 3D instruments that read up to 500 regions (also rebadged by BioRad as BioPlex 3D), and MagPix instruments that use a CCD camera to read up to 50 magnetic bead regions. Sites participating in the EQAPOL multiplex program were free to use any bead reader model to read their EQA assays. The magnetic bead assay observations in the EQAPOL dataset were binned by reader type — Luminex/Bio-Plex 100/200 (n=11,201 observations, 71.7%), FlexMAP3D/BioPlex 3D (n=1,750, 11.2%), or MagPix (n=2,665, 17.1%) — and the contribution of bead reader model to the accuracy and variability of the results were then evaluated. Across EP6-EP22, the average result fell within 1.15-fold of the consensus and none of the instrument types consistently resulted in data that were closer to the consensus (Figure 4D). The various reader models evaluated were, therefore, all similarly accurate. When all data from EP6-EP22 were combined, there was no difference in normalized within-site variance, indicating the bead reader model did not significantly affect well-to-well precision (Figure 4E). Similarly, the 95% CI for normalized between-site variances across all EPs overlapped (Figure 4F), indicating reproducibility between sites was not affected by bead reader selection. Overall, bead reader model did not significantly influence assay performance.
3.3.6. Assay Kit Manufacturer
Multiple manufacturers produce Luminex bead assay kits for cytokine quantification. In addition, carboxylated beads can be conjugated to capture antibodies within individual laboratories to create in-house assays. We therefore tested if assay kit manufacturer contributed to Luminex bead assay variability. For seven EPs run between EP7 and EP19, participating laboratories could opt to receive a second blinded sample set to assay using the kit of their choice. Participating laboratories elected to use assay kits from a range of brands as well as lab-developed assays (Figure S2). Data obtained from the site-selected kits were evaluated for precision but did not contribute to site proficiency scores or grades.
In each of the site-selected assays, the ICCs of the site-selected kit results were compared to that of the common EQAPOL kit. The ICC is the ratio of between-site variability to total variability for a given analyte and sample, so a higher ICC indicated a greater proportion of the variability in the data was due to between site-differences. There were 25 ICCs calculated for six of the seven EPs and 15 ICCs for EP7. A Wilcoxon Signed-Rank test was used to compare the kit types, and ICCs were found to be significantly higher for the site-selected assays for all 7 EPs (EP7 p=0.0256; EP19 p=0.0048; EP11 p=0.0002; EP9,EP10,EP15 and EP17 p<0.0001) (Figure 5). Therefore, the use of site-selected kits led to a relative increase in between-site variability with most of the ICCs being close to one, meaning that nearly all variability was between-site and that there was lower consistency in between site concentration estimates. There was no evidence of a difference in the within-site CVs for the common EQAPOL kit versus site-selected kits.
Figure 5: Effect of assay kit manufacturer on the precision and accuracy of Luminex bead assays.

Sites participating in seven EQAPOL multiplex external proficiency (EP) tests assayed blinded cytokine-spiked test serum using both a lot-matched common kit and protocol and a site-selected kit and protocol. Plots show the intraclass correlation of the observations per analyte per sample from the site-selected versus common assay kits for EPs 7, 9, 10, 11, 15, 17 and 19. Black bars indicate mean.
The inconsistencies in results between assay manufacturers were further evident in evaluation of the three samples with the most measurements in site-selected assays (a minimum of 174 observations per analyte per sample from 58 independent assays). Concentrations obtained using the site-selected kit and common EQAPOL kit were compared using a mixed effects model that accounted for sample, analyte, and kit type. Reported IL-6 concentrations were significantly higher when determined using the site-selected kits. In contrast, reported concentrations of TNF-α, IL-2, IL-6 and IFN-γ were all significantly lower when determined using the site-selected assays (Table 1). Overall, the results obtained for the same analytes from kits by different manufacturers could not be combined without increasing between-site variability and reducing accuracy.
Table 1:
Comparison of concentrations determined by site-selected and common EQAPOL multiplex assay kits using Luminex technology. Results are from mixed effect models for each analyte using LN-transformed site-determined pg/mL concentrations of three test samples in at least 58 independent assays of each kit type.
| Analyte | Site Choice Geometric Mean | Common Kit Geometric Mean | Difference | Ratio | p-value |
|---|---|---|---|---|---|
| IFN-γ | 224.16 | 340.20 | −116.0 | 0.66 | <0.0001 |
| IL-10 | 401.27 | 814.58 | −413.3 | 0.49 | <0.0001 |
| IL-2 | 118.69 | 135.32 | −16.60 | 0.88 | <0.0001 |
| IL-6 | 34.480 | 25.469 | 9.000 | 1.35 | <0.0001 |
| TNF-α | 105.09 | 128.78 | −23.70 | 0.82 | <0.0001 |
3.3.7. Technical Ability
Once per year from EP12 onwards, laboratories participating the EQA program could request an additional testing kit to assess the proficiency of a second technician. Data from both technicians’ runs were submitted and analyzed separately, and each technician was given a proficiency score. Figure 6A shows the overall proficiency scores for each pair of technicians from the same laboratory assaying separate aliquots of the same test samples using identical assay kits and following the same standardized EQAPOL multiplex protocol. While at some laboratories both technicians received similar scores, at others there were large discrepancies between individuals despite using the same kit lots, EQA samples, protocols and laboratory facilities.
Figure 6: Effect of technical ability on Luminex bead assay proficiency.

(A) The overall EQAPOL multiplex assay proficiency score for pairs of technicians in the same laboratory participating in the same External Proficiency (EP) round scored using the same lots of assay kit and samples (n=77 pairs from 6 EP rounds). For EQAPOL EP6-EP22: (B) the overall scores received per EQAPOL multiplex versus the perceived skill/experience reported for the technician prior to EQA proficiency scoring (n=439, red bars show geometric mean± 95% CI), (C) 95% CI of the normalized within-assay variance against perceived skill/experience reported prior to EQA proficiency scoring, and (D) 95% CI of the average fold-difference from the consensus against perceived skill/experience reported prior to EQA proficiency scoring.
Laboratory contacts – who in some cases were also a participating technician – reported the perceived skill/experience level of each technician performing an EQA assay using a 1–5-point scale (low to high) during the data upload process (i.e. after assay completion but prior to scoring). Only seven scored assays were performed by a technician with perceived skill of 1 out of 5; of the remaining 432 assays for which skill levels were reported, perceived skill/experience correlated with a higher average overall proficiency score (Figure 6B). The correlation between perceived skill/experience and the proficiency score received indicated that EQAPOL EQA scores can provide an assessment of technical ability.
The data reported for each EQA assay were then categorized as being obtained by either the most highly skilled/experienced technicians (reported technical skill/experience of 5 out of 5; n=6,748 of 15,470 observations (43.6%)) or less skilled/experienced technicians (reported skill of 1–4 out of 5; n= 8,722 observations) and evaluated for the role of technical ability on accuracy and variance. Assays performed by highly-skilled individuals were subject to significantly less within-assay variability (Figure 6C). Highly-skilled technicians also generated results that were significantly more consistent between sites (Figure 6D). The analyte concentrations determined by highly-skilled individuals and individuals with less skill/experience were, on average, similarly close to the consensus concentration, so assay accuracy was not significantly improved in assays run by the most skilled set of technicians (Figure 6E). Overall, superior technical ability contributed to the ability to perform a Luminex bead assay precisely and consistently but was less critical for accuracy.
3.3.8. Short- and Long-Term Sample Stability
There is little consensus regarding the length of time specific cytokines are stable under common storage conditions. For some cytokines there is a paucity of data; for others, the conclusions are either conflicting or are based on data from a handful of laboratories using noncomparable approaches (Simpson et al., 2020). Data from the EQAPOL multiplex EQA program were therefore interrogated to determine if increased sample storage duration correlated with changes in reported analyte concentration. Sample stability was evaluated in two timeframes: short-term (weeks – within a single EP) and long-term (years – over multiple EPs).
Proficiency test samples were prepared at the central laboratory, stored in single-use aliquots at −80°C, shipped to sites on dry ice with in-shipment temperature-loggers, and then stored at the sites at −80°C until assayed. For each EP, samples were dispatched on the same day but arrived at sites at different times depending on location, and sites had four weeks from receipt of the shipment to complete the assay. The number of days between shipment and assay therefore varied between sites. Three EPs that used independent batches of test samples were evaluated for relationships between the length of time from shipment to assay and the site-determined analyte concentrations per sample. Any sites receiving a “Poor” score were excluded and the remaining data were interrogated using mixed effects models. The null hypothesis was that there was no relationship between ship-to-assay time and assay results; raw and multiple-measurement-adjusted p-values are listed in Supplemental Table S2 for every sample evaluated. No strong evidence was found for short-term sample degradation.
Each batch of sample aliquots was used across multiple EPs. The three batches of samples used for eight EPs were evaluated for long-term analyte stability by fixed-effect modelling. The EPs were approximately six months apart and so the data covered approximately four years per cytokine per sample, and EP number was used as a surrogate for time. The final model used 11,313 observations that were natural log-normalized, and adjusted for sample with fixed effects for analyte concentration and the analyte concentration by EP interaction. There were no significant interacting effects between EP and analyte concentration for IFN-γ or IL-10, indicating these analytes were stable over the eight EPs. In contrast, there were significant (p<0.0005) negative interacting effects for EP with the concentrations of IL-2 and IL-6. These indicated an approximate drop in observed pg/mL concentration of about 2% per six months. There was also a positive interacting effect for TNF-α (p<0.0001) of roughly a 3% increase in observed concentration per six months (Figure 7A). Overall, there was strong evidence for gradual changes in the observed concentrations of some analytes when samples were stored at −80°C for several years.
Figure 7: Effects of long-term sample stability and kit manufacturing lot on the analyte concentrations observed by Luminex bead assay.

(A) Select proficiency testing samples were observed 11,313 times over eight External Proficiency (EP) rounds approximately six months apart, and the data evaluated by fixed-effects modeling for interacting effects between EP (as a substitute for time) and analyte concentration. Data show the EP-analyte interacting effect by analyte ± 95% CI. (B) Proficiency testing samples were observed a total of 15,573 times by Luminex assay using nine different manufacturing lots of the same assay kit. Lots A-H were used for two EPs each, and lot I for one EP. Each differently-colored plot shows the LN-transformed average analyte concentrations for aliquots of a single sample tested using the indicated kit lots.
3.3.9. Assay Manufacturing Lot
The above stability analysis was potentially confounded by the need to use multiple assay kit manufacturing lots because the manufacturer-determined shelf lives were only around 12 months per kit. Each kit lot was, therefore, only used for a maximum of two EPs while each sample was tested across up to eight EPs. To evaluate if assay kit manufacturing lot played a significant role in the observed pg/mL per sample, the LN-transformed average observed analyte concentration for each test sample was plotted against the assay kit lot used to obtain the result (Figure 7B). The plot of 15,573 observations showed that the results were highly consistent from lot-to-lot, and fixed-effect modeling found no significant interaction effect between kit lot and sample (p=0.8163). Manufacturing lot therefore had no discernible effect on the per-sample results.
3.3.10. Remediation
Sites receiving “Poor” or “Fair” grades in the EQA program were invited to remediation teleconferences with technical experts from the EOL. While conclusively determining the root cause of each low score was not possible, some recurring issues were identified during these discussions. Broad categories of subjective causes for low scores are summarized in Figure S3. Approximately 11% of low scores were resolvable as errors in completion of the data upload template, and 21% were likely due to bead reader issues such as clogging, failure to calibrate and validate the reader prior to use, poor maintenance or other instrument-related factors. Over a quarter of the low scores were attributed to poor liquid handling, deviations from the protocol for reconstitution of the standards, or other technique-related issues. Unfamiliarity with the assay platform and/or EQAPOL protocol were identified as probable causes for about 20% of low scores. The remaining 21% of remediation discussions identified no discernible cause for poor performance. Sixty-three percent of sites that underwent remediation between EP7 and EP21 had acceptable performance (“Good” or “Excellent” grades) in the next EP in which they participated. Sites nearly always participated in remediation when invited to do so, and so there was no suitable control group to evaluate if the improvement in performance was facilitated by remediation.
4. Discussion
The NIH/NIAID Division of AIDS EQAPOL program has provided external cytokine measurement proficiency testing based on Luminex technology to domestic and international laboratories for over 12 years. While proficiency scores improved early in the program, likely due to the switch from polystyrene to paramagnetic beads, the average score stabilized within four years of program launch, and the program did not lead to a long-term improvement in within-assay precision or between-laboratory consistency (Figure 1E,F). However, the majority of technicians participating were highly skilled/experienced (Figure 6B), and the majority of the participating sites enrolled because they were approved to run cytokine bead array assays on clinical samples for NIAID/CIC-supported trials. Several participating laboratories were also certified under the United States federal Clinical Laboratory Improvement Amendments (CLIA) program or similar quality assessments. Consequently, the participant pool was highly competent and familiar with the assay platform. Indeed, the mean observed intra-assay CVs for all analytes were on average under 9% (Figure 3A), considerably below the 15–20% generally considered acceptable for immunoassays (Findlay et al., 2000) and considered good/high precision in other evaluations of the Luminex platform (Erkens et al., 2018; Lasseter et al., 2020). The majority of participating laboratories may, therefore, have been operating at close to the best achievable real-world performance for the platform. In support of this idea, the 95% confidence intervals of between site variance for the two most recent single-technician EPs (EP19 and EP21) overlap with those for EP6 and EP7 conducted a decade earlier. This is despite changes in most other variables in this timeframe, including assay kit lot, test sample lot, the number and identities of participating sites, and the technicians performing the assay at each site.
In addition to assessing assay proficiency relative to peers, the data collected during EP6-EP22 enabled evaluation of real-world performance of the Luminex bead array platform across multiple sites, technicians, instruments, and years. To our knowledge, no similarly expansive datasets have been reported for any other multiplexed cytokine assay platform. While many of the factors evaluated were intuitively likely to be important contributors to assay variability, these assumptions had not previously been quantitatively evaluated using such a large dataset, and the relative magnitude of their impact was previously unknown.
The EQAPOL multiplex program’s primary goal is to provide rigorous external proficiency testing for participating laboratories. While the resulting data are valuable for evaluating the assay platform, there are limitations to using them in this way. Firstly, the study was not designed to test variability or accuracy between kit manufacturers, and so cannot be used to compare assay performance between specific brands. Secondly, the study did not compare assay results against single-plex bead assays or ELISAs, so the potential for interacting effects between bead regions or analytes cannot be evaluated (Kingsmore, 2006). Thirdly, internationally-accepted standards or reference reagents were not included in the testing panel or used to make the test samples, and so the evaluation of accuracy was relative rather than absolute and the percentages of spiked cytokines recovered could not be evaluated.
All data collected for this study were obtained using human sera spiked with cytokines at concentrations across the assay’s expected performance ranges. This may not be reflective of performance when assaying biologically-relevant analyte concentrations; however, the cytokines of interest in sera from healthy individuals are often close to or below the typical lower limits of quantitation for the assay platform, making the use of non-spiked sera for EQA challenging. Nascent multiplex assay technologies that offer greater sensitivities than conventional bead arrays, for example single molecule ELISAs (Rissin et al., 2010), proximity extension assays (Lundberg et al., 2011), and oligonucleotide aptamer-based platforms (Gold et al., 2010), may provide new opportunities for quantitative assessment of cytokines in unadulterated sera. Beyond the sensitivity limitations inherent to the platform, this study focused on IL-2, IL-6, IL-10, TNF-α, and IFN-γ in sera as these were the most commonly assayed analyte set and sample matrix in the participating laboratories at the start of the program. Whether these findings can be extrapolated to other sample matrices or analytes remains to be evaluated.
Regardless of these limitations, our analysis leads us to make several recommendations regarding best practices for performing commercial Luminex-bead based assays for human cytokines. Firstly, all assays should be performed using paramagnetic-bead based kits made by a single manufacturer and run using a standardized protocol. Samples should be stored at −80°C and, when project timelines allow, assayed in a single batch ideally using kits from the same manufacturing lot. When possible, samples should be assayed by one highly-skilled technician at a single site. Where this is not practical, samples should be assayed by highly-skilled technicians at sites participating and exhibiting strong performance in an external quality assurance program. The bead reader(s) used to conduct the study should be appropriately maintained and calibrated, and performance validated prior to each run. However, the bead reader model used is not critical. Where resources permit, automated washing technology may provide small gains in within-assay consistency but is more a question of convenience than performance, and multi-site studies relying on automated bead washing should standardize washer technologies and parameters to minimize between-site variance. When designing experiments and evaluating data, it should be remembered that performance will vary by analyte and that precision will decrease in the sub-100 pg/mL range. This is the biologically-relevant range for many common cytokines in human serum and plasma, and so studies of these sample types will need more replicates to achieve similar power as studies of cytokinerich sample types such as cell culture supernatant. We did not find evidence for extensive sample degradation in samples stored at −80°C for up to four years. However, this is based on assessment of recombinant cytokines added to pooled sera and may not reflect the stability of endogenous analytes. Regardless of this caveat, four years is longer than the manufacturer-determined shelf-life of the assay beads, and so banking samples to assay at the end of a study may be preferable to assaying samples at intermittent timepoints using different assay kit manufacturing lots run by different technicians.
In conclusion, here we report observations from a twelve-year, multi-site quality assurance program for Luminex bead array assays. In the future, we anticipate comparing performance of this assay platform against the multiplexed cytokine assay platforms that have emerged in the decade since this EQAPOL program was first launched.
Supplementary Material
Multiplex bead array assays are widely used for quantification of cytokines in serum.
Data from 12 years of multi-site proficiency testing were used to evaluate multiplexed bead assay variability.
Technical ability, assay vendor, analyte, and analyte concentration were significant contributors to bead assay performance.
Bead reader instrument model and the use of automated washers were found not to be major contributors to assay performance.
Acknowledgments
Current and past members of the NIAID/CIC Luminex Steering Committee include: Dr. Patricia D’Souza, representing NIAID; Drs. Michael Kalos and Michael Pride, representing CIC; Dr. Lisa Butterfield, University of Pittsburg, and Dr. Gregory Sempowski, Duke University. The authors are grateful for the guidance of Dr. Jim Lane (NIAID) and thank the following for their assistance with the EQAPOL multiplex program: Ambrosia Garcia, Linda Walker, Sara Brown, Holly Alley and Jennifer Baker (EQAPOL Repository), Ana Sanchez, Cassie Porth and Darin Weed (EQAPOL Central Laboratory), Dr. Marcella Sarzotti-Kelsoe (QADVIP), the EQAPOL Scientific Advisory Board, and the RBL Immunology Unit staff. Finally, the authors thank the anonymous CIC and NIAID sites for their participation and thoughtful feedback.
Funding
This work was funded by the Division of AIDS, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under contract Nos. HHSN272201000045C and HHSN272201700061C. The funder had no role in the study design, data collection and analysis, or preparation of the manuscript. Work was performed in the Regional Biocontainment Laboratory (RBL) at Duke University, which received partial support for construction and renovation from the National Institutes of Health (UC6-AI058607 and G20-AI167200), and facility support from the National Institutes of Health (UC7-AI180254).
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Competing Interests
The authors have no competing interests to declare.
REFERENCES
- Butterfield LH, Potter DM and Kirkwood JM, 2011, Multiplex serum biomarker assessments: technical and biostatistical issues. J Transl Med 9, 173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chowdhury F, Williams A and Johnson P, 2009, Validation and comparison of two multiplex technologies, Luminex and Mesoscale Discovery, for human cytokine profiling. J Immunol Methods 340, 55–64. [DOI] [PubMed] [Google Scholar]
- de Jager W, Bourcier K, Rijkers GT, Prakken BJ and Seyfert-Margolis V, 2009, Prerequisites for cytokine measurements in clinical trials with multiplex immunoassays. BMC Immunol 10, 52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Jager W and Rijkers GT, 2006, Solid-phase and bead-based cytokine immunoassay: a comparison. Methods 38, 294–303. [DOI] [PubMed] [Google Scholar]
- Djoba Siawaya JF, Roberts T, Babb C, Black G, Golakai HJ, Stanley K, Bapela NB, Hoal E, Parida S, van Helden P and Walzl G, 2008, An evaluation of commercial fluorescent bead-based luminex cytokine assays. PLoS One 3, e2535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dossus L, Becker S, Achaintre D, Kaaks R and Rinaldi S, 2009, Validity of multiplex-based assays for cytokine measurements in serum and plasma from “non-diseased” subjects: comparison with ELISA. J Immunol Methods 350, 125–32. [DOI] [PubMed] [Google Scholar]
- Erkens T, Goeminne N, Kegels A, Byloos M and Vinken P, 2018, Analytical performance of a commercial multiplex Luminex-based cytokine panel in the rat. J Pharmacol Toxicol Methods 91, 43–49. [DOI] [PubMed] [Google Scholar]
- Faresjo M, 2014, A useful guide for analysis of immune markers by fluorochrome (Luminex) technique. Methods Mol Biol 1172, 87–96. [DOI] [PubMed] [Google Scholar]
- Findlay JW, Smith WC, Lee JW, Nordblom GD, Das I, DeSilva BS, Khan MN and Bowsher RR, 2000, Validation of immunoassays for bioanalysis: a pharmaceutical industry perspective. J Pharm Biomed Anal 21, 1249–73. [DOI] [PubMed] [Google Scholar]
- Fitzmaurice GM, Laird NM and Ware JH 2004. Applied longitudinal analysis. Wiley-Interscience, Hoboken, N.J. [Google Scholar]
- Fu Q, Zhu J and Van Eyk JE, 2010, Comparison of multiplex immunoassay platforms. Clin Chem 56, 314–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gold L, Ayers D, Bertino J, Bock C, Bock A, Brody EN, Carter J, Dalby AB, Eaton BE, Fitzwater T, Flather D, Forbes A, Foreman T, Fowler C, Gawande B, Goss M, Gunn M, Gupta S, Halladay D, Heil J, Heilig J, Hicke B, Husar G, Janjic N, Jarvis T, Jennings S, Katilius E, Keeney TR, Kim N, Koch TH, Kraemer S, Kroiss L, Le N, Levine D, Lindsey W, Lollo B, Mayfield W, Mehan M, Mehler R, Nelson SK, Nelson M, Nieuwlandt D, Nikrad M, Ochsner U, Ostroff RM, Otis M, Parker T, Pietrasiewicz S, Resnicow DI, Rohloff J, Sanders G, Sattin S, Schneider D, Singer B, Stanton M, Sterkel A, Stewart A, Stratford S, Vaught JD, Vrkljan M, Walker JJ, Watrobka M, Waugh S, Weiss A, Wilcox SK, Wolfson A, Wolk SK, Zhang C and Zichi D, 2010, Aptamer-based multiplexed proteomic technology for biomarker discovery. PLoS One 5, e15004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gunther A, Becker M, Gopfert J, Joos T and Schneiderhan-Marra N, 2020, Comparison of Bead-Based Fluorescence Versus Planar Electrochemiluminescence Multiplex Immunoassays for Measuring Cytokines in Human Plasma. Front Immunol 11, 572634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingsmore SF, 2006, Multiplexed protein measurement: technologies and applications of protein and antibody arrays. Nat Rev Drug Discov 5, 310–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lash GE, Scaife PJ, Innes BA, Otun HA, Robson SC, Searle RF and Bulmer JN, 2006, Comparison of three multiplex cytokine analysis systems: Luminex, SearchLight and FAST Quant. J Immunol Methods 309, 205–8. [DOI] [PubMed] [Google Scholar]
- Lasseter HC, Provost AC, Chaby LE, Daskalakis NP, Haas M and Jeromin A, 2020, Cross-platform comparison of highly sensitive immunoassay technologies for cytokine markers: Platform performance in post-traumatic stress disorder and Parkinson’s disease. Cytokine X 2, 100027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lundberg M, Thorsen SB, Assarsson E, Villablanca A, Tran B, Gee N, Knowles M, Nielsen BS, Gonzalez Couto E, Martin R, Nilsson O, Fermer C, Schlingemann J, Christensen IJ, Nielsen HJ, Ekstrom B, Andersson C, Gustafsson M, Brunner N, Stenvang J and Fredriksson S, 2011, Multiplexed homogeneous proximity ligation assays for high-throughput protein biomarker research in serological material. Mol Cell Proteomics 10, M110 004978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch HE, Sanchez AM, D’Souza MP, Rountree W, Denny TN, Kalos M and Sempowski GD, 2014, Development and implementation of a proficiency testing program for Luminex bead-based cytokine assays. J Immunol Methods 409, 62–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKay HS, Margolick JB, Martinez-Maza O, Lopez J, Phair J, Rappocciolo G, Denny TN, Magpantay LI, Jacobson LP and Bream JH, 2017, Multiplex assay reliability and long-term intra-individual variation of serologic inflammatory biomarkers. Cytokine 90, 185–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rissin DM, Kan CW, Campbell TG, Howes SC, Fournier DR, Song L, Piech T, Patel PP, Chang L, Rivnak AJ, Ferrell EP, Randall JD, Provuncher GK, Walt DR and Duffy DC, 2010, Single-molecule enzyme-linked immunosorbent assay detects serum proteins at subfemtomolar concentrations. Nat Biotechnol 28, 595–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rountree W, Vandergrift N, Bainbridge J, Sanchez AM and Denny TN, 2014, Statistical methods for the assessment of EQAPOL proficiency testing: ELISpot, Luminex, and Flow Cytometry. J Immunol Methods 409, 72–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simpson S, Kaislasuo J, Guller S and Pal L, 2020, Thermal stability of cytokines: A review. Cytokine 125, 154829. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
