Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2021 May 17;2021:180–189.

Exploring the Hazards of Scaling Up Clinical Data Analyses: A Drug Side Effect Discovery Case Report

Franck Diaz-Garelli 1, Todd R Johnson 2, Mohammad H Rahbar 2, Elmer V Bernstam 2
PMCID: PMC8378643  PMID: 34457132

Abstract

We assessed the scalability of pharmacological signal detection use case from a single-site CDW to a large aggregated clinical data warehouse (single-site database with 754,214 distinct patient IDs vs. multisite database with 49.8M). We aimed to explore whether a larger clinical dataset would provide clearer signals for secondary analyses such as detecting the known relationship between prednisone and weight. We found significant weight gain rate using the single-site data but not from using aggregated data (0.0104 kg/day, p<0.0001 vs. -0.050 kg/day, p<.0001). This rate was also found more consistently across 30 age and gender subgroups using the single-site data than in the aggregated data (26 vs. 18 significant weight gain findings). Contrary to our expectations, analyses of much larger aggregated clinical datasets did not yield stronger signals. Researchers must check the underlying model assumptions and account for greater heterogeneity when analyzing aggregated multisite data to ensure reliable findings.

Introduction

Implementation of electronic health records (EHRs) has enabled secondary analysis of clinical data for a variety of purposes1,2. Traditional time-consuming and costly research methods such as randomized controlled trials (RCTs)2-5 are being complemented with secondary analyses of clinical data generated via clinical practice as a core activity of burgeoning learning healthcare systems6,7. As one important use case, 32% of novel therapeutics approved by the FDA were associated with a post-marketing safety event8. Thus, post-marketing discovery and surveillance for drug side effects (pharmacoepidemiology) using retrospective analyses of large clinical data sets has gained interest.

Though the current abundance of machine-readable clinical data sets enables such analyses, clinical data are prone to quality issues9-13. These issues are sometimes considered part of the noise that may mask signals in EHR data. It is often argued that larger data sets will amplify the signals, allows a more reliable estimate for the noise and improves the statistical power for detecting associations14-18. In the healthcare setting, this can be achieved in two ways. Healthcare systems can wait until their databases have enough patients to study a specific phenomenon, in which case, a more traditional research approach such as single site experimental study such as an RCT is developed and implemented. An alternative strategy, is for multiple healthcare systems to aggregate their data in clinical data warehouses (CDWs)19,20 to increase statistical power21, making clinical signals more detectable by secondary analysis methods such as regression analysis, as long as the underlying assumptions of these models are met. Research efforts are currently geared toward building large CDWs,22-27 yet we found few studies in the literature reporting the impact of EHR data aggregation28,29 or providing solutions to its consequences30 and we found none illustrating its impact on traditional analytical methods such as regression analysis.

In this article, we attempt to "rediscover" the association between prednisone, a commonly prescribed corticosteroid, and weight gain31,32 with longitudinal linear regression methods using two databases: a single-site CDW33 and a much larger aggregated CDW. We chose this association because it is well-accepted by clinicians34 and is defined by relatively objective numerical data (i.e., drug administration and weight change over time). Further, weight measurement does not require complex equipment and is often recorded during routine clinical care for a wide variety of patients across care settings. Thus, investigating the relationship between prednisone and weight gain is, in a sense, the "best case" scenario for assessment of factors associated with drug-side effect. Our goal was to "rediscover" a known side effect without leveraging knowledge about the side effect. Thus, we approximated the process of monitoring for previously unknown side effects. The contribution of this work is to illustrate an existing challenge of scaling up analyses using large, aggregated CDWs.

Methods

Data Sources - We used longitudinal linear regression methods to analyze the known relationship between prednisone and weight gain using real EHR data extracted from a single-site CDW and an aggregated CDW database. The single-site data set was extracted from an outpatient clinic's EHR production database and contained 754,214 distinct patient IDs with data from April 2004 to January 2014. The aggregated data set, Cerner Health Facts (Cerner Corporation, Kansas City, MO) a HIPAA de-identified database, contained over 49.8 million distinct patient IDs with data from January 2000 to October 2014. We selected patients with at least one prednisone prescription, all their recorded weights and covariates such as age and gender. Descriptive statistics are shown in Table 1. No missing values were found for age, and gender variables in either data set. The weight variable was normally distributed. Because the distribution of drug exposure was not normal, we categorized exposure into a binary variable encoded as high (above mean) vs. low (below mean) exposure. We filtered out patients under 21 years of age and extreme outliers for weight (i.e., weight>400 kg). A second round of filtering was performed on the weight variable by excluding measurements that were more than three standard deviations away from the mean on both sides. This study has been approved by the Committee for the Protection of Human Subjects (the UTHSC-H IRB) under protocol HSC-SBMI-13-0549.35

Table 1. Descriptive statistics for single and aggregated datasets. Note that the original aggregated database contains over sixty times more patients than the single-site database, whereas the extracted data has slightly over 18 times more patients.

Single-Site Aggregated
Number of Distinct Patient IDs in Database 754,214 49,826,219
Timespan Covered Apr. 2004 - Jan. 2014 Jan. 2000 - Oct. 2014
Number of Patients Included (% of Database) 9,767 (1.29%) 169,944 (0.34%)
Number of Weight Measurements 93,617 2,278,953
Missing Drug Exposure 15.40% 20.40%
Average Age (in years) for 1st Prescription 54±15 63±17
Median Age (in years) for 1st Prescription 55 64
MeanSD Weight 84.222kg 82.524kg
MeanSD Prednisone Exposure 312697mg 7811,740mg

Statistical Analysis - We used summary statistics such as mean, median and extreme values (e.g., values more than three standard deviations away from the mean) to screen the data for outliers, missing values and erroneous data. We assessed normality of continuous variables based on histograms. To detect weight gain over time, we built a longitudinal linear regression model using generalized estimating equations (GEE) method. Our longitudinal regression model predicted the main outcome (weight) based on drug exposure (high/low), the number of days from the first prescription (time), the patient's age at the time of first prescription, the patient's gender. An autoregressive correlation structure was used to account for potential correlation between multiple measurements from the same individuals. Statistical significance was set at a=0.05. A model was built on weight as the dependent variable with independent variables including, time and a binary exposure group (cumulative prednisone dose below or above the mean) along with known covariates: gender and age. The observational time span (or time windowing) was varied around the time of prescription between 90 days before prescription and 360 days after prescription to improve model fit. We used the Quasi-likelihood under Independence Model Criterion (QIC)36 as a quantitative measure of goodness-of-fit. QIC is a generalization of Akaike's information criterion which represents the goodness of fit of statistical models typically employed with GEE methods and informs model selection. We used SAS (version 9.4, SAS Institute Inc., Cary, NC) for statistical analysis.

Sensitivity Analysis - To assess the robustness of our results, we built regression models for multiple sub-populations contained in our data set using the same methods described above. For example, we built separate regression models for males only, females only, initial weight below or above average and for patients with ages below or above average at the time of prescription. These subgroup analyses aimed to identify which subgroups of the cohort contributed to the overall effect. If the effect varies by multiple subgroups, it indicates potential effect modifications that should be explored.

Results

We built longitudinal linear regression models to optimize model fit (i.e., reduce QIC) exploring time windows between 10 days prior and 120 days after prescription. The best fit was found for 7-90 days and 7-32 days for the single-site and aggregated data respectively (Table 2). This matches what is clinically expected: patients gain weight within the first three months of starting prednisone. These conditions returned the largest effect sizes for the relationship between prednisone and weight in both data sets: 0.0114 kg/day (p<0.0001) for the single-site data and - 0.050 kg/day (p<0.001) but no association with exposure to prednisone (p=0.847) based on the aggregated data.

Table 2. Results from Longitudinal Linear Regression models with weight as the dependent variable under varying parameter assumptions based on the single-site dataset (n1=93,617) and the aggregated dataset (n2=2,278,953).

Overall Regression (After Prescription) Upper Time Limit Upper and Lower Time Limit
Single-site Aggregated Single-site Aggregated Single-site Aggregated
Parameter Estimate p-value Estimate p-value Estimate p-value Estimate p-value Estimate p-value Estimate p-value
Intercept 83.45 <.0001 93.8 <.0001 82.05 <.0001 92.6 <.0001 84.2715 <.0001 92.2 <.0001
Time (Days from Prescription) 0 0.9004 -0.0032 <.0001 -0.0048 0.0038 -0.061 <.0001 0.0104 <.0001 -0.050 <.0001
Age (in years) -0.087 <.0001 -0.2399 <.0001 -0.0729 <.0001 -0.228 <.0001 -0.101 <.0001 -0.213 <.0001
Gender M 14.43 <.0001 8.3245 <.0001 14.78 <.0001 8.60 <.0001 14.10 <.0001 7.81 <.0001
Gender (Ref.) F
Exposure (Classified) Hi -0.56 0.389 0.149 .0002 -1.40 0.0266 0.146 <.0001 -1.64 0.0616 -2.51 0.847
Exposure (Ref.) Lo
Days*Exposure (Class) Hi -0.0003 0.237 -0.00008 <.0001 0.0158 <.0001 -0.0194 <.0001 0.0156 0.0055 1.47 0.068
Days* Exposure (Ref.) Lo
QIC Value 79,020 835,487 15,669 393,087 7,335 120,133

Overall, our longitudinal linear regression (i.e., all data after prescription) yielded no statistical significance for time (p=0.900), exposure (p=0.389) or the interaction between these two variables (p=0.237). The aggregated dataset showed significance for all variables, most probably due to the large size of the dataset. However, the model estimated that the weight gain was negative over time, though the amount of weight loss was not statistically significant. Model fit was measured using QIC, with lower values indicating better fit, but no specific reference ranges. Both models showed a very high QIC=70,020 (single site) and QIC=835,847 (aggregated) indicating very poor fit of the model to the data.

Setting an upper temporal boundary at 90 and 32 days from first prescription for each data set respectively brought all variables beyond the significance threshold (time: p<0.0001, exposure: p=0.0266, interaction between these two variables: p<0.0001), improving the fit (QIC=15,669) for the single site data set. The final coefficient estimate value for time was 0.0104 kg/day (~3.8 kg/year), which revealed a positive correlation between time and weight increase in patients prescribed prednisone. On the other hand, the aggregated data showed significance for all variables. The QIC value improved to QIC=393,087.

We found larger QIC values indicating a poorer fit for larger upper time limits (60 and 90 days). Setting a delayed lower time limit where the data considered for the regression started at 7 days after the first prescription improved fit (QIC=7335), preserving statistical significance for all variables except binary (high/low) exposure (p=0.0645) in the single-site dataset. Adding this lower time limit to the aggregated data set yielded significance for all variables except exposure (p=0.847) and the interaction between exposure and time (p=0.068). However, the estimate for the time variable was negative (-0.050 kg/day) indicating a slight decrease in weight over time. It is important to note that the minimum QIC found for the aggregated data set was much higher than the single-site data set (QIC=120,133 vs. 7335), suggesting that a larger, more complex data set may include much more variability and may be much harder to account for heterogeneity in variance by this model. In summary, we found weight gain with prednisone in the single-site data matching the expected outcome but slight weight loss with prednisone in the aggregated data.

Sensitivity Analysis - To assess the robustness of our findings, we performed sensitivity analysis by doing subgroup analyses (Figure 1). The single-site CDW showed a positive relationship between time and weight for all categories. Only age categories above 70 and patients with an initial weight below average showed no significance. These findings suggest homogeneity within the subgroups of this single-site data set with an upward trend detectable in multiple subgroups. On the other hand, the findings from the aggregated data were inconsistent. Regression analysis provided 11 negative estimated effects (only 5 statistically significant) and 4 positive estimated effects (none statistically significant). The largest effects were found for patients in their 20s (estimated effect=-0.514 kg/day, p=0.338), 50s (estimated effect=-0.426 kg/day, p=0.261), age below average (estimated effect=-0.292 kg/day, p=0.046) and male patients (estimated effect=-0.320 kg/day, p=0.006). These results suggest that this data set contains heterogeneous sub-populations that have very different weight changes over time.

Figure 1.

Figure 1.

Relationship between weight and follow up time for sub-populations within the single-site and aggregated data sets. The single-site data set presents similar results for most sub populations, whereas the aggregated data set returned much more variable results.

We compared size of data set and sampling rate (i.e., measures per patient) for the two data sets (Table 3). Overall, the aggregated data had only a slightly lower sampling rate than the single-site data set (8.65 vs. 9.56 measurements/patient). In contrast, for each optimized time window presenting the best QIC fit (i.e., 7-90 days and 7-32 days for the single-site and aggregated datasets respectively and found that the aggregated data had roughly twice the sampling rate compared to the single site data. However, the aggregated data set contained multiple visit types such as outpatient, inpatient and emergency, whereas the single-site database contained outpatients only. Comparing outpatient data only, we found 2.96 vs. 9.65 measurements/patient for the single and aggregated data sets respectively and 1.26 vs. 1.84 measurements/patient respectively within the selected time window. In this case, the lower sampling rate may partly explain the lower p-values due to lower statistical power for the single-site data in both analyses (p=0.014 vs. p<0.0001). The lower measurement count per patient could be explained by the difference in time windows 7-90 days vs. 7-32 days. However, using the 7-90 day time window and outpatients only the CDW data set returned a smaller sample than the single-site data set (1.69 vs. 1.84 measurements/patient). Analysis of these data showed weight gain (estimate=0.0114 kg/day), yet not significant (p=0.487).

Table 3. Regression analysis results summary for each data set and analytical method with patient counts and total number of measurements. Black arrows indicate statistically significant findings, whereas gray arrows represent p-values greater than 0.05.

Single-Site Aggregated
Data set Full Data Set Days 7-90 (Strongest Signal) Full Data Set Days 7-32 (Strongest Signal) Outpatients Only Outpatients Days 7-32 Outpatients Days 7-90
Patient Count 9,854 4,812 169,056 34,194 37,890 7,008 12,055
Total Number of Weight Measurements 94,233 8,872 1,355,508 127,550 112,098 8,874 20,370
Average Number of Weight Measurements Per Patient 9.56 1.84 8.02 3.73 2.96 1.27 1.69
Statistical Estimate (kg/day) 0 (p=0.900) 0.0104 (p<0.0001) -0.0032 (p<.0001) -0.050 (p<.0001) 0.0005 (p=0.781) -0.0280 (p=0.0014) 0.0114 (p=0.487)
Finding

Finally, we built a model by controlling for the site where the weight measure was recorded on the aggregated dataset; this is often done for multisite aggregated data. Controlling for site did not change our conclusions regarding rate of change in weight over time, finding very similar estimates to that in the final model from Table 2 (e.g., -0.0503, p<.0001 vs. -0.050, p<.0001 for the time variable). The dataset contained 146 different sites out of which 35 returned statistically significant estimates for the effect of time on weight; This revealed significant differences across sites. We then evaluated potential interaction between exposure and site terms (i.e., exposure*site), finding 63 statistically significant interaction terms out of 273, revealing that prednisone exposure depended on the clinical site in 23% of all sites. We also found very similar estimates (e.g., -0.0498, p<.0001 vs. - 0.050, p<.0001 for time) in this model that accounted for interactions, which led to the same weight loss conclusion. These additional results indicate a robust statistical difference across sites and a significant interaction between the site and exposure in this particular aggregated database. However, controlling only for the study site did not improve the model, change the relationship between weight and time or the conclusions regarding the effect of prednisone on weight in the larger aggregated dataset.

Discussion

We found that prednisone was associated with weight loss, weight gain or no effect depending on the analysis methods, data set and assumptions. The much larger aggregated data set seemed more heterogeneous, including multiple visit types (e.g., outpatient, inpatient, emergency department, etc.), highly variable sampling rate (i.e., number of weight measurements per patient), multiple sources of data that seemed to present statistically distinct results and an interaction between the exposure and the site. This counters the idea that a larger aggregated data set would present a stronger overall signal and yield more robust findings in all circumstances, as we did not consistently find the expected association between prednisone and weight gain using traditional statistical regression methods. Subgroups analyses also yielded more inconsistent results in the aggregated dataset compared to the single-site dataset. In summary, the aggregated dataset appeared more heterogeneous and presented more analytical challenges during analyses.

Many studies have explored issues related to EHR data quality and bias2,9,37-42. However, the heterogeneous nature of CDWs has only been mentioned as a potential hazard43-48 and few publications have shown or quantitated the impact of heterogeneous aggregated clinical data sets on analysis.28-30 Data quality is often defined as "fitness for purpose"49-51. However, "fitness for purpose" is difficult to define in the absence of a clear understanding of the data sources constituting a clinical dataset.52,53 Without this knowledge, it is difficult to anticipate threats to "fitness for purpose", because the processes that produced the analytical data are unknown and potentially unknowable due to data merging and deidentification. In other words, it is much more difficult to assess whether an aggregated dataset is "fit-for-a-particular-purpose" than for a single-site dataset (e.g., a local CDW) where data production processes are known by the analytical team. Simply put, it is more challenging to assess whether aggregated data are "fit-for-a-particular-purpose" than for a single-site/local CDW where data production processes are known. This case report illustrates this limitation and our findings suggest that CDW heterogeneity and data quality may affect analytical outcomes that may not be mitigated by using a larger database.

Our results suggest that these data quality and potential biases can change analysis outcome and are not necessarily mitigated by using a larger database. This may be due to the fact that clinical data reflect workflow, convention and multiple other factors in addition to "biology".38,54-57 The number of such confounders grows with the aggregation of data from multiple care contexts including institutions, settings (e.g., outpatient vs. inpatient, primary care vs. subspecialty, medical vs. surgical, etc.) and other contexts. We found that visit types (inpatient vs. outpatient vs. emergency department) was one of the strongest confounders, which is more intimately related to clinical workflow than the biology of the studied population. This is consistent with existing work.56,57 We also found differences across sites and interactions that must be taken into account for reliable analysis58. This hints at the potential existence of analytical challenges particular to aggregated CDW (i.e., heterogeneous clinical workflow-derived data collected into a single database) that go beyond pitfalls expected in analyses making use of data produced for research purposes (e.g., multi-site data). These challenges must be further defined, and their effects evaluated in future research.

Though more data generally means more statistical power and narrower confidence intervals32,33, it is important to consider potential heterogeneities34 in large multisite aggregated clinical data sets. Because patients from diverse institutions (i.e., diverse clinical workflows), receiving diverse modes of treatment are included, the data may not be comparable14,27,35-37. This has been often cited in the statistical literature64,65 but could be easily overlooked in applied clinical informatics data reuse. The complexity of data production processes and aggregated databases are a constant threat to appropriate data use66. It may be possible to tease out this complexity by employing reasonable assumptions that rely on clinical and healthcare workflow knowledge33. For example, Hripcsak et al.67 were able to reproduce previous Pneumonia Outcomes Research Team (PORT) studies using EHR data, but only after eliminating the vast majority (up to 90%) of the patients. In addition, their analysis (like ours) also benefited from knowing the expected results.

Understanding the workflow that produced the data being analyzed is often challenging; particularly for multisite data aggregated across institutions and care settings. Incomplete understanding of how a data set was produced is, generally a hazard to analysis and interpretation of analytical results from large clinical data sets66. Clinical workflow may be difficult to define for a particular clinic or patient type even at a single institution. For example, are patients weighed routinely on every visit or only when there is reason to suspect a change or a specific clinical question? Are prescriptions from all providers, some of whom may be from other institutions, reliably entered into the source EHR? Our findings suggest that this threat grows with clinical data aggregation. They also suggest caution when reusing large aggregated CDWs for secondary analyses and the use of automated "big data" approaches to find previously unrecognized side effects. Also, just like in RCT data analysis, there are other statistical threats such as Simpson's paradox68,69 and sub-group heterogeneity65,70 that must be considered under even if the research team has perfect knowledge of how the data were produced.

A strength of our study is that we used two relatively large, robust data sets. The single-institution data included over 750 thousand patients and the multi-site database included over 49.8 million patients; among the largest clinical data sets that exist in the US.

Our study has limitations that will be addressed in future work. First, the analytical methods selected were not comprehensive. However, we ran similar regression on both datasets for the sake of comparability and explored additional regressions on the aggregated dataset that uncovered its complexities. To better understand the reasons for our findings, we chose traditional statistical models rather than neural network-based machine learning approaches (e.g., deep learning methods) that produce "opaque" models71. Though these methods could potentially enable accurate analyses despite poor data quality and hetorogeneity,72 understanding their role in analysis scalability was beyond the scope of this initial work. Most analysis decisions were driven by the need to replicate of a previous analysis33 where the effect was found for the single-site dataset. This was done to allow for comparable analytical results. For example, we used a longitudinal regression models that are based on GEE methods with a categorization of the prednisone exposure variable per the initial analysis setup. Other advanced methods and their adaptation to these aggregated datasets will be explored in future work. Second, we chose a particular drug-side effect association. Thus, our results may not apply to other associations, particularly side effects that do not evolve over time. For example, it may be easier to discover events such as myocardial infarctions55 than trends over time. Third, we did not investigate all possible sub-populations and potential covariates (e.g., race and ethnicity), concurrent clinical events such as other clinical conditions, drugs taken or surgical procedures. Many additional variables could potentially improve accuracy, some of which may be difficult to accurately assess (e.g., compliance with medications). An alternative approach would be a case-control retrospective study design with propensity scoring to control for potential confounding effects. Although these issues are clearly important, our primary goal was to assess the impact of size of data set on the detection of drug side effect associations in large aggregated clinical data sets rather than to build an optimal model for this particular case or to discover new biology.

Conclusion

Analysis of larger clinical databases does not necessarily generate clearer overall signals, in particular if the large data set is aggregated from multiple heterogeneous data sources. Analyses that successfully detected a known association in a single-site dataset were unable to detect this same association in a much larger aggregated data set despite of leveraging significant knowledge about the side effect (e.g., expected timing relative to exposure). Analyses of large, aggregated, anonymized data sets require attention to additional details addressing their heterogeneity beyond basic analysis design point typically considered for smaller, single-site clinical datasets.

Acknowledgements

This work was partially supported in part by UTHealth Innovation for Cancer Prevention Research Pre-doctoral Fellowship (Cancer Prevention and Research Institute of Texas grant #160015), NIH NCATS grants UL1 TR001420, UL1 TR000371 and UL1 TR001105, NIH NCI grant U01 CA180964, NSF grant III 0964613, the Brown Foundation, Inc., NIGMS Institutional Research and Academic Career Development Award (IRACDA) program (K12-GM102773) and the Bridges Family Partnership, Ltd. - Sally and Joe Bridges, Jennifer and Todd Darwin, and Beth and Drew Cozby. This work was conducted with data provided by and support from the Cerner Corporation and UTHealth School of Biomedical Informatics. The content is solely the responsibility of the author(s) and does not necessarily represent the official views of the Cerner Corporation, the UTHealth School of Biomedical Informatics, The University of North Carolina at Charlotte or the National Institutes of Health.

Figures & Table

References

  • 1.Safran C. Reuse Of Clinical Data. IMIA Yearb. 2014;9(1):52–4. doi: 10.15265/IY-2014-0013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Weiner MG, Embi PJ. Toward Reuse of Clinical Data for Research and Quality Improvement: The End of the Beginning? Ann Intern Med. 2009 Sep 1;151(5):359–60. doi: 10.7326/0003-4819-151-5-200909010-00141. [DOI] [PubMed] [Google Scholar]
  • 3.Prather JC, Lobach DF, Goodwin LK, Hales JW, Hage ML, Hammond WE. Medical data mining: knowledge discovery in a clinical data warehouse. Proc AMIA Annu Fall Symp. 1997. pp. 101–5. [PMC free article] [PubMed]
  • 4.Blum RL. Discovery, confirmation, and incorporation of causal relationships from a large time-oriented clinical data base: The RX project. Comput Biomed Res. 1982 Apr;15(2):164–87. doi: 10.1016/0010-4809(82)90035-0. [DOI] [PubMed] [Google Scholar]
  • 5.Frawley WJ, Piatetsky-Shapiro G, Matheus CJ. Knowledge discovery in databases: An overview. AI Mag. 1992;13(3):57. [Google Scholar]
  • 6.Safran C. Reuse Of Clinical Data. IMIA Yearb. 2014;9(1):52–54. doi: 10.15265/IY-2014-0013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Safran C. Using routinely collected data for clinical research. Stat Med. 1991 Apr;10(4):559–564. doi: 10.1002/sim.4780100407. [DOI] [PubMed] [Google Scholar]
  • 8.Downing NS, Shah ND, Aminawung JA, Pease AM, Zeitoun J-D, Krumholz HM, et al. Postmarket Safety Events Among Novel Therapeutics Approved by the US Food and Drug Administration Between 2001 and 2010. JAMA. 2017 May 9;317(18):1854–63. doi: 10.1001/jama.2017.5150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hersh WR, Weiner MG, Embi PJ, Logan JR, Payne PRO, Bernstam EV, et al. Caveats for the Use of Operational Electronic Health Record Data in Comparative Effectiveness Research. Med Care. 2013 Aug;51(8 0 3):S30–7. doi: 10.1097/MLR.0b013e31829b1dbd. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Liaw S-T, Taggart J, Dennis S, Yeo A. Data quality and fitness for purpose of routinely collected data - a general practice case study from an electronic Practice-Based Research Network (ePBRN) AMIA Annu Symp Proc. 2011;2011:785–94. [PMC free article] [PubMed] [Google Scholar]
  • 11.Nobles AL, Vilankar K, Wu H, Barnes LE. Evaluation of data quality of multisite electronic health record data for secondary analysis. 2015 IEEE International Conference on Big Data (Big Data) 2015. pp. 2612–20.
  • 12.O’Malley KJ, Cook KF, Price MD, Wildes KR, Hurdle JF, Ashton CM. Measuring Diagnoses: ICD Code Accuracy. Health Serv Res. 2005 Oct 1;40(5p2):1620–39. doi: 10.1111/j.1475-6773.2005.00444.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Farzandipour M, Sheikhtaheri A, Sadoughi F. Effective factors on accuracy of principal diagnosis coding based on International Classification of Diseases, the 10th revision (ICD-10) Int J Inf Manag. 2010 Feb 1;30(1):78–84. [Google Scholar]
  • 14.The End of Theory: The Data Deluge Makes the Scientific Method Obsolete [Internet] WIRED. [cited 2014 Sep 24]. Available from: http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory .
  • 15.John Walker S. Big Data: A Revolution That Will Transform How We Live, Work, and Think. Int. J Advert. 2014 Jan;33(1):181–3. [Google Scholar]
  • 16.Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014;2:3. doi: 10.1186/2047-2501-2-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Belle A, Thiagarajan R, Soroushmehr SMR, Navidi F, Beard DA, Najarian K. Big Data Analytics in Healthcare [Internet] BioMed Research International. 2015. [cited 2019 Feb 7]. Available from: https://www.hindawi.com/journals/bmri/2015/370194/abs/ [DOI] [PMC free article] [PubMed]
  • 18.Sun J, Reddy CK. Big Data Analytics for Healthcare. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining [Internet]; New York, NY, USA. ACM; 2013. pp. 1525–1525. [cited 2019 Feb 7]. (KDD ’13). Available from: http://doi.acm.org/10.1145/2487575.2506178 . [Google Scholar]
  • 19.Brown JS, Holmes JH, Shah K, Hall K, Lazarus R, Platt R. Distributed Health Data Networks: A Practical and Preferred Approach to Multi-Institutional Evaluations of Comparative Effectiveness, Safety, and Quality of Care. Med Care. 2010 Jun;48:S45–51. doi: 10.1097/MLR.0b013e3181d9919f. [DOI] [PubMed] [Google Scholar]
  • 20.Holmes JH, Elliott TE, Brown JS, Raebel MA, Davidson A, Nelson AF, et al. Clinical research data warehouse governance for distributed research networks in the USA: a systematic review of the literature. J Am Med Inform Assoc. 2014 Jul 1;21(4):730–6. doi: 10.1136/amiajnl-2013-002370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Tabachnick BG, Fidell LS. Using multivariate statistics. 5th ed. Boston, MA: Allyn & Bacon/Pearson Education; 2007 xxvii. p. 980. (Using multivariate statistics, 5th ed) [Google Scholar]
  • 22.Hripcsak G, Duke J, Shah N, Reich C, Huser V, Schuemie M, et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. MEDINFO. 2015;15 [PMC free article] [PubMed] [Google Scholar]
  • 23.Hripcsak G, Ryan PB, Duke JD, Shah NH, Park RW, Huser V, et al. Characterizing treatment pathways at scale using the OHDSI network. Proc Natl Acad Sci. 2016 Jul 5;113(27):7329–36. doi: 10.1073/pnas.1510502113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhou X, Murugesan S, Bhullar H, Liu Q, Cai B, Wentworth C, et al. An Evaluation of the THIN Database in the OMOP Common Data Model for Active Drug Safety Surveillance. Drug Saf. 2013 Feb 1;36(2):119–34. doi: 10.1007/s40264-012-0009-3. [DOI] [PubMed] [Google Scholar]
  • 25.Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) J Am Med Inform Assoc. 2010 Mar 1;17(2):124–30. doi: 10.1136/jamia.2009.000893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Natter MD, Quan J, Ortiz DM, Bousvaros A, Ilowite NT, Inman CJ, et al. An i2b2-based, generalizable, open source, self-scaling chronic disease registry. J Am Med Inform Assoc. 2013 Jan 1;20(1):172–9. doi: 10.1136/amiajnl-2012-001042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Segagni D, Tibollo V, Dagliati A, Perinati L, Zambelli A, Priori S, et al. The ONCO-I2b2 project: integrating biobank information and clinical data to support translational research in oncology. Stud Health Technol Inform. 2011;169:887–91. [PubMed] [Google Scholar]
  • 28.Seneviratne MG, Kahn MG, Hernandez-Boussard T. Biocomputing 2019 [Internet] WORLD SCIENTIFIC; 2018. Merging heterogeneous clinical data to enable knowledge discovery; pp. 439–43. [cited 2020 Aug 13]. Available from: https://www.worldscientific.com/doi/abs/10.1142/9789813279827_0040 . [PMC free article] [PubMed] [Google Scholar]
  • 29.Glynn EF, Hoffman MA. Heterogeneity introduced by EHR system implementation in a de-identified data resource from 100 non-affiliated organizations. JAMIA Open. 2019 Dec 1;2(4):554–61. doi: 10.1093/jamiaopen/ooz035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Fu S, Leung LY, Raulli A-O, Kallmes DF, Kinsman KA, Nelson KB, et al. Assessment of the impact of EHR heterogeneity for clinical research through a case study of silent brain infarction. BMC Med Inform Decis Mak. 2020 Mar 30;20(1):60. doi: 10.1186/s12911-020-1072-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Baker JF, Sauer BC, Cannon GW, Teng C-C, Michaud K, Ibrahim S, et al. Changes in Body Mass Related to the Initiation of Disease-Modifying Therapies in Rheumatoid Arthritis. Arthritis Rheumatol Hoboken NJ. 2016 Aug;68(8):1818–27. doi: 10.1002/art.39647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.WUNG PK, ANDERSON T, FONTAINE KR, HOFFMAN GS, SPECKS U, MERKEL PA, et al. Effects of Glucocorticoids on Weight Change During the Treatment of Wegener’s Granulomatosis. Arthritis Rheum. 2008 May 15;59(5):746–53. doi: 10.1002/art.23561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Diaz-Garelli J-F, Bernstam EV, MSE Rahbar MH, Johnson T. Rediscovering drug side effects: the impact of analytical assumptions on the detection of associations in EHR data. AMIA Summits Transl Sci Proc. 2015 Mar 25;2015:51–5. [PMC free article] [PubMed] [Google Scholar]
  • 34.PredniSONE Tablets [Package Insert] Ridgefield, CT: Boehringer-Ingelheim Inc; 2012. [Google Scholar]
  • 35.Guerrero SC, Sridhar S, Edmonds C, Solis CF, Zhang J, McPherson DD, et al. Access to Routinely Collected Clinical Data for Research: A Process Implemented at an Academic Medical Center. Clin Transl Sci. 2019;12(3):231–5. doi: 10.1111/cts.12614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Pan W. Akaike’s Information Criterion in Generalized Estimating Equations. Biometrics. 2001 Mar 1;57(1):120–5. doi: 10.1111/j.0006-341x.2001.00120.x. [DOI] [PubMed] [Google Scholar]
  • 37.Botsis T, Hartvigsen G, Chen F, Weng C. Secondary Use of EHR: Data Quality Issues and Informatics Opportunities. AMIA Summits Transl Sci Proc. 2010;2010(1) [PMC free article] [PubMed] [Google Scholar]
  • 38.Hripcsak G, Knirsch C, Zhou L, Wilcox A, Melton GB. Bias Associated with Mining Electronic Health Records. J Biomed Discov Collab. 2011 Jun 6;6:48–52. doi: 10.5210/disco.v6i0.3581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Rea S, Bailey KR, Pathak J, Haug PJ. Bias in Recording of Body Mass Index Data in the Electronic Health Record. AMIA Summits Transl Sci Proc. 2013;2013:214–8. [PMC free article] [PubMed] [Google Scholar]
  • 40.Rusanov A, Weiskopf NG, Wang S, Weng C. Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med Inform Decis Mak. 2014 Jun 11;14(1):51. doi: 10.1186/1472-6947-14-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Diaz-Garelli J-F, Wells BJ, Yelton C, Strowd R, Topaloglu U. Biopsy Records Do Not Reduce Diagnosis Variability in Cancer Patient EHRs: Are We More Uncertain After Knowing? AMIA Jt Summits Transl Sci Proc AMIA Jt Summits. Transl Sci. 2018;2017:72–80. [PMC free article] [PubMed] [Google Scholar]
  • 42.Diaz-Garelli J-F, Strowd R, Wells BJ, Ahmed T, Merrill R, Topaloglu U. Lost in Translation: Diagnosis Records Show More Inaccuracies After Biopsy in Oncology Care EHRs. AMIA Summits Transl Sci Proc. 2019 May 6;2019:325–34. [PMC free article] [PubMed] [Google Scholar]
  • 43.Sittig DF, Hazlehurst BL, Brown J, Murphy S, Rosenman M, Tarczy-Hornoch P, et al. A survey of informatics platforms that enable distributed comparative effectiveness research using multi-institutional heterogeneous clinical data. Med Care. 2012 Jul;50(Suppl):S49–59. doi: 10.1097/MLR.0b013e318259c02b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Conway M, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, et al. Analyzing the Heterogeneity and Complexity of Electronic Health Record Oriented Phenotyping Algorithms. AMIA Annu Symp Proc. 2011;2011:274–83. [PMC free article] [PubMed] [Google Scholar]
  • 45.Chute CG, Pathak J, Savova GK, Bailey KR, Schor MI, Hart LA, et al. The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities. AMIA Annu Symp Proc AMIA Symp AMIA Symp. 2011;2011:248–56. [PMC free article] [PubMed] [Google Scholar]
  • 46.Sun J, Wang F, Hu J, Edabollahi S. Supervised Patient Similarity Measure of Heterogeneous Patient Records
  • 47.Grissom RJ. Heterogeneity of variance in clinical data. [DOI] [PubMed]
  • 48.Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. J Am Med Inform Assoc. 2013 Jan 1;20(1):117–21. doi: 10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013 Jan;20(1):144–151. doi: 10.1136/amiajnl-2011-000681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Holve E, Kahn M, Nahm M, Ryan P, Weiskopf N. A comprehensive framework for data quality assessment in CER. AMIA Summits Transl Sci Proc. 2013;2013:86–8. [PMC free article] [PubMed] [Google Scholar]
  • 51.Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN, et al. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. eGEMs [Internet] 2016 Sep 11;4(1) doi: 10.13063/2327-9214.1244. [cited 2017 Apr 6]. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051581/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Diaz-Garelli J-F, Bernstam EV, Lee M, Hwang KO, Rahbar MH, Johnson TR. DataGauge: A Practical Process for Systematically Designing and Implementing Quality Assessments of Repurposed Clinical Data. EGEMs Gener Evid Methods Improve Patient Outcomes. 2019 Jul 25;7(1):32. doi: 10.5334/egems.286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.DiazVasquez J. DataGauge: A Model-Driven Framework for Systematically Assessing the Quality of Clinical Data for Secondary Use. UT SBMI Diss Open Access [Internet] 2016. Aug 16, Available from: http://digitalcommons.library.tmc.edu/uthshis_dissertations/33 .
  • 54.Schneeweiss S, Glynn RJ, Tsai EH, Avorn J, Solomon DH. Adjusting for Unmeasured Confounders in Pharmacoepidemiologic Claims Data Using External Information: The Example of COX2 Inhibitors and Myocardial Infarction. Epidemiology. 2005 Jan;16(1):17–24. doi: 10.1097/01.ede.0000147164.11879.b5. [DOI] [PubMed] [Google Scholar]
  • 55.Brownstein JS, Sordo M, Kohane IS, Mandl KD. The Tell-Tale Heart: Population-Based Surveillance Reveals an Association of Rofecoxib and Celecoxib with Myocardial Infarction. PLoS ONE. 2007 Sep 5;2(9):e840. doi: 10.1371/journal.pone.0000840. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Diaz-Garelli J-F, Strowd R, Ahmed T, Wells BJ, Merrill R, Laurini J, et al. A tale of three subspecialties: Diagnosis recording patterns are internally consistent but Specialty-Dependent. JAMIA Open [Internet] 2019. Aug 5, [cited 2019 Sep 6]; Available from: https://academic.oup.com/jamiaopen/advance-article/doi/10.1093/jamiaopen/ooz020/5543799 . [DOI] [PMC free article] [PubMed]
  • 57.Diaz-Garelli F, Strowd R, Lawson VL, Mayorga ME, Wells BJ, Lycan TW, et al. Workflow Differences Affect Data Accuracy in Oncologic EHRs: A First Step Toward Detangling the Diagnosis Data Babel. JCO Clin Cancer Inform. 2020 Jun 1;4:529–38. doi: 10.1200/CCI.19.00114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Kp V M L. The Effect of Ignoring Statistical Interactions in Regression Analyses Conducted in Epidemiologic Studies: An Example with Survival Analysis Using Cox Proportional Hazards Regression Model. Epidemiol Open Access [Internet] 2016;06(01) doi: 10.4172/2161-1165.1000216. [cited 2016 Oct 24]. Available from: http://www.omicsonline.org/open-access/the-effect-of-ignoring-statistical-interactions-in-regression-analysesconducted-in-epidemiologic-studies-an-example-with-survival-2161-1165-1000216.php?aid=69316 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE. Jr FEH. Factors Influencing the Statistical Power of Complex Data Analysis Protocols for Molecular Signature Development from Microarray Data. PLOS ONE. 2009 Mar 17;4(3):e4922. doi: 10.1371/journal.pone.0004922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting Sample Size Required for Classification Performance. 2012. [cited 2016 Sep 7]; Available from: http://dl.umsu.ac.ir//handle/Hannan/26109 . [DOI] [PMC free article] [PubMed]
  • 61.Fletcher J. What is heterogeneity and is it important? BMJ. 2007 Jan 11;334(7584):94–6. doi: 10.1136/bmj.39057.406644.68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Hoffman S, Podgurski A. Big Bad Data: Law, Public Health, and Biomedical Databases. J Law Med Ethics. 2013 Mar 1;41:56–60. doi: 10.1111/jlme.12040. [DOI] [PubMed] [Google Scholar]
  • 63.Tatonetti N, Ye P, Daneshjou R, Altman R. Data-Driven Prediction of Drug Effects and Interactions. Sci Transl Med. 2012 Mar 14;4(125):125ra31–125ra31. doi: 10.1126/scitranslmed.3003377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Assmann SF, Pocock SJ, Enos LE, Kasten LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. The Lancet. 2000 Mar 25;355(9209):1064–9. doi: 10.1016/S0140-6736(00)02039-0. [DOI] [PubMed] [Google Scholar]
  • 65.Gelman A, Auerbach J. Age-aggregation bias in mortality trends. Proc Natl Acad Sci. 2016 Feb 16;113(7):E816–7. doi: 10.1073/pnas.1523465113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Van Der Lei J. Use and abuse of computer-stored medical records. Methods Inf Med. 1991;30(2):79–80. [PubMed] [Google Scholar]
  • 67.Hripcsak G, Knirsch C, Zhou L, Wilcox A, Melton GB. Using discordance to improve classification in narrative clinical databases: An application to community-acquired pneumonia. Comput Biol Med. 2007 Mar;37(3):296–304. doi: 10.1016/j.compbiomed.2006.02.001. [DOI] [PubMed] [Google Scholar]
  • 68.Julious SA, Mullee MA. Confounding and Simpson’s paradox. BMJ. 1994 Dec 3;309(6967):1480–1. doi: 10.1136/bmj.309.6967.1480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Wagner CH. Simpson’s Paradox in Real Life. Am Stat. 1982 Feb 1;36(1):46–8. [Google Scholar]
  • 70.Gelman A. Commentary: P Values and Statistical Practice. Epidemiology. 2013;24(1):69–72. doi: 10.1097/EDE.0b013e31827886f7. [DOI] [PubMed] [Google Scholar]
  • 71.Tollenaar N, van der Heijden PGM. Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models. J R Stat Soc Ser A Stat Soc. 2013;176(2):565–84. [Google Scholar]
  • 72.Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. Npj Digit Med. 2018 May 8;1(1):1–10. doi: 10.1038/s41746-018-0029-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES