Skip to main content
Annals of Family Medicine logoLink to Annals of Family Medicine
. 2012 Sep;10(5):473–474. doi: 10.1370/afm.1441

LARGE DATA SETS IN PRIMARY CARE RESEARCH

Jon Meiman, Jeff E Freund
PMCID: PMC3438219  PMID: 22966114

With the widespread adoption of electronic health records (EHRs), researchers have growing access to large data sets that are being used for quality improvement, comparative effectiveness research, and public health policy decision making. In the recent past, large managed care organizations had almost exclusive access to these rich patient data sets. However, EHRs are rapidly leveling the playing field, with academic family medicine programs well positioned to take advantage of this resource and pioneer new fields of study. At the University of Wisconsin Department of Family Medicine (UW-DFM), we recently embarked on a study of polypharmacy that highlights the advantages and challenges of working with large EHR data sets and illustrates both what is possible and what the future may hold.

We began with a simple research question: “What are the patterns and predictors of medication use in our family medicine clinics?”1 Previous studies of poly-pharmacy have been limited to not only small sample sizes, but also focused primarily on elderly populations. Although insurance claims could provide us with a large, diverse sample, they generally do not include many clinically relevant over-the-counter medications and supplements. In addition, insurance claims do not capture prescription medications purchased without insurance, such as those on discount medication lists. Networked EHRs provide new opportunities for obtaining more comprehensive data regarding health services received, especially among populations who are discontinuously insured.2 Fortunately, UW-DFM has access to an EHR database from a network of 28 ambulatory-care clinics in Wisconsin that compiles over 300,000 annual visits. For the study described above, using anonymized data we were able to look at the prevalence of polypharmacy across a wide range of variables, including age, body mass index, smoking status, marital status, and major comorbidities. In the end, we analyzed nearly 2 million unique pieces of data from over 110,000 patients which, to our knowledge, far exceeds any previous study of polypharmacy.

Despite the readily available access to such vast data, our project highlights some of the challenges that face primary care researchers new to working with large EHR data sets. EHR data are gathered for the purposes of health care delivery, and as such, do not adhere to the rigorous standards of scientific studies. Although the sheer volume of data can overcome isolated inaccuracies, large systematic errors can occur. Our data, for example, contained several variables indicative of smoking status that frequently conflicted with one another. This necessitated looking at the entire data set for patterns of inconsistencies to ensure our findings were accurate. We also had to exercise caution not just with the available data, but with missing data as well. Missing data is a common issue with EHRs, and simply ignoring these gaps can lead to very biased results. We used several advanced statistical techniques to account for the uncertainly created by missing data in order to achieve appropriate confidence intervals. Ultimately, data exploration and cleaning constituted the majority of our efforts and should be a prime focus when analyzing EHR data. Finally, the issue of statistical significance takes on new meaning when working with thousands of data points. Unlike smaller studies, where considerable effort is expended to gather an adequate sample size, any sufficiently large data set will allow a researcher to find a “statistically significant” result. Consequently, large data sets require researchers to transition away from mechanistic statistical tests toward a mathematical modeling approach with the goal of discovering clinically relevant findings.3

After addressing these challenges, we were able to both arrive at an estimate of polypharmacy for a large, diverse adult population and identify some of the strongest predictors of heavy medication use. In doing so, we were able to look at segments of the population that were poorly studied and control for a wide range of variables. All of this was made possible by the use of a large EHR data set. We believe our exploratory study is merely scratching the surface of potential research that EHR data sets could ultimately provide. Academic family medicine programs are ideally situated to perform influential studies on population health, treatment effectiveness, disease prognosis, and social determinants of health. This research will not only enhance our understanding of disease, but shape how we practice medicine in the future. As a leader in disease management and preventive care, family medicine should capitalize on this new resource and lead the way in large dataset research.

References

  • 1.Freund J, Meiman JG, Kraus C. Medication Use in a Network of Family Medicine Clinics. Poster session presented at: 45th Annual Spring Conference of the Society of Teachers of Family Medicine; Apr 25–29, 2012, Seattle Washington [Google Scholar]
  • 2.Devoe JE, Gold R, McIntire P, Puro J, Chauvie S, Gallia CA. Electronic health records vs Medicaid claims: completeness of diabetes preventive care data in community health centers. Ann Fam Med. 2011;9(4):351–358 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Rodgers JL. The epistemology of mathematical and statistical modeling: a quiet methodological revolution. Am Psychol. 2010;65(1):1–12 [DOI] [PubMed] [Google Scholar]

Articles from Annals of Family Medicine are provided here courtesy of Annals of Family Medicine, Inc.

RESOURCES