Skip to main content
. 2023 Oct 17;193(3):426–453. doi: 10.1093/aje/kwad201

Table 3.

Studies That Used Negative Controls to Assess the Performance of Methods for Drug Safety and Effectiveness Studies

First Author, Year (Reference No.) Description of the Use of NCs in the Study
Studies That Used NCs as Reference Standards in Evaluating the Performance of Pharmacovigilance Methods
Brown, 2007 (165) The study assessed the performance of sequential analysis for active surveillance of ADR in large observational databases. The performance was measured by applying sequential analysis to 5 PC and 2 NC drug-outcome pairs in administrative data from 9 health plans participating in the HMO Research Network’s Center for Education and Research on Therapeutics. The signal was determined by testing the maximized sequential probability ratio against the null hypothesis, and the number of significant signals among PC and NC drug-outcome pairs was reported.
Brown, 2009 (166) The study assessed the variation in the performance of sequential analysis with alternative analytical choices with regard to exclusion criteria and surveillance period, and the results were compared with the performance of sequential analysis with the original set of analytical choices, which was assessed in an earlier study (Brown et al. (165)). The same data source and methods from the previous study were used for the assessment including the choice of PC and NC drug-outcome pairs.
Ryan, 2012 (177) The study evaluated the performance (measured with sensitivity, specificity, PPV, and AUC) of 8 different methods in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network, using 53 unique drug-outcome pairs (9 PC and 44 NC drug-outcome pairs). The methods evaluated were: 1) disproportionality analysis, 2) ICTPD, 3) SCCS, 4) case-control surveillance, 5) case-crossover, 6) observational screening, 7) high-dimensional propensity score, and 8) incident user design.
Schuemie, 2012 (182) The study evaluated the performance (measured with AUC) of 10 different methods to detect ADR in health-care databases participating in EU-ADR, using a set of PCs and NCs were as reference standards. The methods evaluated in the study included: 1) spontaneous reporting system methods (PRR, ROR, GPS, BCPNN), 2) cohort methods (IRR, LGPS, Bayesian hierarchical model), 3) case-based methods (matched case-control, SCCS), and 4) LEOPARD.
DuMouchel, 2013 (162) The study evaluated the performance (measured with AUC) of disproportionality analysis with different combinations of analytical choices (with regard to outcome definition, disproportionality metric, stratification, time at risk across) for 4 health outcomes (acute liver failure, acute MI, acute renal failure, and upper GI bleeding) in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network and 6 simulated data sets. A collection of 165 PC and 234 NC drug-outcome pairs were used as reference standards for evaluation.
Madigan, 2013 (168) The study evaluated the performance of case-control design for ADR risk identification and compared AUC for the design with various combinations of analytical choices with regard to: 1) number of controls, 2) minimum time before outcome, 3) time at risk, and 4) cohort nesting within indication. The reference standards included 165 PC and 234 NC drug-outcome pairs across 4 health outcomes (acute liver failure, acute MI, acute renal failure, upper GI bleeding). The performance was evaluated in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network.
Norén, 2013 (171) The study evaluated the performance of the calibrated self-controlled cohort analysis with temporal pattern discovery as a tool for risk identification, by comparing AUC for different combinations of the design choices with regard to: 1) surveillance period and 2) control period. The method was applied to 165 PC and 234 NC drug-outcome pairs across 4 health outcomes (acute liver failure, acute MI, acute renal failure, and upper GI bleeding). The performance was evaluated in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network.
Reich, 2013 (175) The study evaluated the performance of different algorithms to define health outcomes of interest for 3 health outcomes (acute kidney injury, acute liver injury, and acute MI) for risk identification methods in observational data. The algorithms were tested across an array of analytical methods and design choices (including choices for preexposure control period and time at risk) in 4 US health-care databases participating in OMOP network. The methods were executed against predefined drug-outcome pairs (test cases) that included PCs and NCs. The performance was measured with AUC, bias, and minimal detectable relative risk.
Reich, 2013 (176) The study assessed the effects of drug and outcome prevalence on the feasibility and performance for risk identification methods in observational data. The methods of interest (new user cohort design, case control design, SCCS, SCC, ICTPD, disproportionality analysis, and LGPS) were tested against PCs and NCs across 4 health outcomes (acute kidney injury, acute liver injury, acute MI, and upper GI bleeding) in 3 US health-care databases participating in the OMOP network. The performance was measured with AUC.
Ryan, 2013 (178) The study evaluated the performance (measured with AUC, bias, and coverage probabilities) of new-user cohort method for detecting safety signals in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network and 6 simulated data sets. Different combinations of analytical choices were tested, including choices for: 1) required observation time prior to the exposure, 2) nesting within population with the indication of the target drug, 3) comparator population, 4) time-at-risk, 5) propensity score covariate selection strategy, 6) covariate eligibility window, 7) dimension to include as potential covariates, 8) additional covariates in the propensity score model, 9) propensity score trimming, and 10) analysis method. The method was tested against 165 PC and 234 NC drug-outcome pairs across 4 health outcomes (acute liver failure, acute MI, acute renal failure, and upper GI bleeding).
Ryan, 2013 (179) The study evaluated the performance of SCC method for detecting safety signals in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network and 6 simulated data sets. The performance metrics of AUC, bias (measured with relative risk estimates from the NC analysis) and coverage probabilities (proportion of the CIs that contained the true relative risk) for different combinations of design choices of the method were measured. Different combinations of design choices were tested, including choices for 1) exposure definition, 2) outcome definition, 3) time-at-risk, 4) control period. A set of 165 PC and 234 NC drug-outcome pairs across 4 health outcomes (acute liver failure, acute MI, acute renal failure, and upper GI bleeding) were used as reference standards.
Ryan, 2013 (180) The study compared the performance (measured with AUC, bias, MSE, and CI coverage probability) of 7 established pharmacovigilance methods including: 1) new user cohort, 2) case-control, 3) SCCS, 4) SCC, 5) disproportionality analysis, 6) temporal pattern discovery, and 7) LGPS. The methods were tested against 165 PC and 234 NC drug-outcome pairs across 4 health outcomes (acute liver injury, acute MI, acute renal failure, and GI bleeding) in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network.
Ryan, 2013 (181) The study used simulated health-care data to evaluate the performance (measured with AUC, bias, and MSE) of 7 observational designs for ADR identification. The tested methods included: 1) case-control, 2) new user cohort method, 3) disproportionality methods, 4) ICTPD, 5) LGPS, 6) SCC, and 7) SCCS. The methods were tested against 399 reference drug-outcome pairs (including 165 PCs and 234 NCs) across 4 health outcomes (acute liver injury, acute MI, acute renal failure, and GI bleeding) in 6 simulated data sets.
Schuemie, 2013 (183) The study replicated the OMOP experiment in European databases and compared the performance of the established methods for risk identification, including: 1) case-control, 2) new user cohort method, 3) disproportionality methods, 4) ICTPD, 5) LGPS, 6) SCC, and 7) SCCS. The methods were tested against 165 PC and 234 NC drug-outcome pairs across 4 health outcomes (acute liver injury, acute MI, acute renal failure, and GI bleeding) in 6 health-care databases (administrative claims and EHR) participating in the EU-ADR database network. The performance was measured with AUC and bias.
Schuemie, 2013 (184) The study evaluated the performance (measured with AUC, bias, and coverage probabilities) of LGPS and LEOPARD in in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network and 6 simulated data sets in safety signal detection. Different combinations of analytical choices were tested, including choices for: 1) data source, 2) exposure definition, 3) run-in period, 4) carry-over period, and 5) shrinkage application. The method was tested against 165 PC and 234 NC drug-outcome pairs across 4 health outcomes (acute liver failure, acute MI, acute renal failure, upper GI bleeding).
Suchard, 2013 (185) The study evaluated the performance (measured with AUC, bias, and coverage probabilities) of SCCS for safety signal detection in 5 US health-care databases (administrative claims and EHR) participating in the OMOP network and 6 simulated data sets. Different combinations of design choices were tested, including choices for: 1) outcome definition, 2) multivariate adjustment, 3) at-risk window, 4) observation time-window definition, and 5) minimum observation length. The method was tested against 165 PC and 234 NC drug-outcome pairs across 4 health outcomes (acute liver failure, acute MI, acute renal failure, and upper GI bleeding).
Gagne, 2014 (167) The study developed a semiautomated process for active safety monitoring for new drugs in distributed data environment for health-care data. Five PC and 2 NC drug-outcome pairs were defined and used for testing the validity of the process, and whether the system generated an alert for each drug-outcome pair was reported.
Pratt, 2015 (174) The study evaluated the robustness and consistency of PSSA to detect ADRs across settings in 5 different countries. The method was applied to 1 PC and 2 NCs as test cases. The ASR and 95% CIs were estimated for the associations with ADRs for each country.
Backenroth, 2016 (163) The study compared AUCs for the case-control method with different analytical choices (including choices for: 1) covariates used (none, demographics, phenome-wide association study (PHEWAS) groups plus demographics), 2) covariate selection method (none, or 1-step LASSO), and 3) estimation method (marginal odds ratio, 1 model per drug-outcome pair, or 1 model per outcome)) across 4 health outcomes (acute kidney injury, acute liver injury, acute MI, and GI ulcer hospitalization) in a US EHR database.
Marbac, 2016 (169) The study proposed a model-based method for analyzing spontaneous reporting databases (Bayesian model selection in logistic regression), and compared the performance (measured with number of signals, rate of positive controls, rate of negative controls, rate of unknown signals) to the well-established approaches, including 4 disproportionality-based methods (PRR, ROR, reporting Fisher exact test, FDR-based GPS, Lasso-based logistic regressions, and model-based logistic regressions) across 4 health outcomes (acute MI, acute kidney injury, acute liver injury, and upper GI bleeding) in the French pharmacovigilance database. The OMOP reference set of test cases were used to evaluate the performance, which included 165 PC and 234 NC drug-outcome pairs.
Osokogu, 2016 (172) The study assessed the performance (AUC) of 2 established signal detection algorithms (PRR and empirical Bayes geometric mean) in the pediatric population in data from the US FAERS. Thirty-seven PC and 90 NC drug-outcome pairs from a pediatric-specific Global Research in Pediatric reference set were used.
Hauben, 2017 (164) The study estimated the effects of database restriction for oncology drug on the signal detection performance using FAERS data. Multi-item GPS analysis, a type of disproportionality analysis, was carried out in restricted and unrestricted data sets for a predefined oncology-specific reference set (including 638 PC and 67 NC drug-outcome pairs). The method performance was measured with sensitivity, specificity, PPV, NPV, signal/noise, F, and Matthews correlation coefficient.
Nishtala, 2017 (170) The study assessed the performance of PSSA in the New Zealand prescription database for risk identification using 6 PC drug exposures and 6 NC drug exposures The adjusted sequence ratios and 95% CIs were estimated for the associations with ADRs.
Pierce, 2017 (173) The study assessed the potential utility of social media data for monitoring drug safety signals and examined whether a safety signal appears in social media prior to being reported to a spontaneous reporting system. An analysis of social media posts was conducted for 10 safety signals (PCs) recently identified by FDA, and 6 NC drug-outcome pairs. The number of associations appeared in the social media data was reported for each of the pairs as an outcome.
Trinh, 2018 (161) The study evaluated the performance of combining change-point analysis method with disproportionality analysis for detecting changes in safety signals in the French National Pharmacovigilance Database (Base Nationale de Pharmacovigilance or BNPV) and EudraVigilance database. The performance was assessed by testing the change-point analysis hypothesis against the null hypothesis for 39 PC and 56 NC drug-outcome pairs selected from the OMOP reference set in EudraVigilance database, and 2 PC drug-outcome pairs were used as test cases in BNPV database.
Thurin, 2020 (186) The study assessed the performance of 3 case-based designs (SCCS, case-control, and case-population) for identifying risk of upper GI bleeding in the French National Healthcare System Database. Different combinations of design and analytical choices for each method were tested against the reference standards (22 PC and 42 NC drugs for upper GI bleeding), and the performance (measured with AUC, MSE, and coverage probability) was compared. The best-performing method was selected for calibration using the empirical P-value calibration method to improve the accuracy of the method.
Studies That Used NCs to Compare Different Methods for Bias Mitigation in Pharmacoepidemiologic Studies
McGrath, 2015 (86) The study assessed the residual confounding associated with different analytical choices (marginal structural models with different specifications and sample restriction in multiple different ways) in evaluating influenza vaccine (comparing the vaccinated vs. unvaccinated) effectiveness in adult hemodialysis patients. The study used data from the US Renal Data System linked with data from a commercial dialysis provider to compare all-cause mortality between the groups during 3 influenza seasons. The outcome measured in pre-influenza season was used as a NC outcome to measure residual confounding. The study found that marginal structural models did not improve confounding controls even after adding additional variables. Restricting the sample to more homogeneous and healthier populations (e.g., by applying stricter survival requirements) moved the results toward the null in pre-influenza seasons, indicating reduction of residual bias.
Davies, 2017 (31) The study applied 3 techniques (NC outcome, NC populations, and tests of covariate balance) to assess the relative bias of the effect estimates by 2 analytical methods: 1) instrumental variable (physicians’ prescribing preferences) approach, and 2) a conventional regression. The Clinical Practice Research Datalink was used to investigate the effect of smoking cessation therapies (varenicline vs. nicotine replacement products) on suicide and self-harm, depression, and urinary tract infection was used as the NC outcome. Bias component plot was applied to illustrate the relative bias of the 2 methods.
Weinstein, 2017 (141) The study compared the residual bias associated with 6 different analytical methods (multivariate outcome modeling with 3 different specifications, publication variable PS adjustment, large-scale PS adjustment, and large-scale PS adjustment with large-scale outcome modeling) to adjust confounding in evaluating safety of frequently used over-the-counter medications (paracetamol vs. ibuprofen) using data from the Clinical Practice Research Datalink. The study used 31 NC outcomes to assess the bias and reported the proportion of NC outcomes that were statistically significant as an indicator of residual bias. While using multivariate outcome modeling alone and the publication variable PS adjustment resulted in residual bias (fraction significant > 5%), the other 2 large-scale PS adjustment methods resulted in no significant NC results.
Kaiser, 2018 (64) The study assessed potential biases associated with different analytical choices in estimating the association (HR) between statin use and incident MI using a retrospective cohort design. Six analytical variations were compared for potential bias: 1) crude analysis, 2) restriction to those eligible for statins, 3) multivariable adjusted among eligible population, 4) restriction to eligible new users, 5) multivariable adjusted among eligible new users, and 6) PS-matched among eligible new users. Noncardiovascular mortality was used as an NC outcome. HRs from negative control analysis were compared with HRs from the primary analysis to assess bias. Data from participants in the Cardiovascular Health Study from 1989 to 2004 were used.
Izurieta, 2019 (60) The study assessed the impact of using multiple imputation in reducing potential bias due to residual confounding in a HZV effectiveness study (comparing the vaccinated vs. unvaccinated) using 13 NC outcomes. The authors had previously published a HZV effectiveness study using Medicare claims data and checked the residual bias using 13 NC outcomes in the original study (59). This study additionally used and linked the MCBS to the Medicare claims to impute 3 new MCBS variables, potential confounders missing in the original analysis. The same 13 NC outcomes from the previous study were used to detect residual confounding in the imputation analysis and compare the results with the original analysis. The study found similar HR estimates compared with the original study across the 13 NC outcomes (CIs overlapping). The point estimates shifted toward the null for 8 out of 13 outcomes, and the CIs were wider for the imputation analysis.

Abbreviations: ADR, adverse drug reaction; ASR, adjusted sequence ratio; AUC, area under the receiver operating characteristic curve; BCPNN, Bayesian confidence propagation neural network; CI, confidence interval; EHR, electronic health records; EU-ADR, Exploring and Understanding Adverse Drug Reactions; FAERS, Food and Drug Administration’s Adverse Event Reporting System; FDA, US Food and Drug Administration; FDR, false discovery rate; GI, gastrointestinal; GPS, γ Poisson shrinker; HMO, health maintenance organization; HR, hazard ratio; HZV, herpes zoster vaccine; ICPTD, information component temporal pattern discovery; IRR, incidence rate ratio; LASSO, least absolute shrinkage and selection operator; LEOPARD, longitudinal evaluation of observational profiles of adverse events related to drugs; LGPS, longitudinal γ Poisson shrinker; MCBS, Medicare Current Beneficiary Survey; MI, myocardial infarction; MSE, mean squared error; NC, negative control; NPV, negative predictive value; OMOP, Observational Medical Outcomes Partnership; PC, positive control; PPV, positive predictive value; PRR, proportional reporting ratio; PS, propensity score; PSSA, prescription sequence symmetry analysis; ROR, reporting odds ratio; SCC, self-controlled cohort; SCCS, self-controlled case series; US, United States.